Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cascaded Cross-Modal Transformer for Audio-Textual Classification (2401.07575v2)

Published 15 Jan 2024 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: Speech classification tasks often require powerful language understanding models to grasp useful features, which becomes problematic when limited training data is available. To attain superior classification performance, we propose to harness the inherent value of multimodal representations by transcribing speech using automatic speech recognition (ASR) models and translating the transcripts into different languages via pretrained translation models. We thus obtain an audio-textual (multimodal) representation for each data sample. Subsequently, we combine language-specific Bidirectional Encoder Representations from Transformers (BERT) with Wav2Vec2.0 audio features via a novel cascaded cross-modal transformer (CCMT). Our model is based on two cascaded transformer blocks. The first one combines text-specific features from distinct languages, while the second one combines acoustic features with multilingual features previously learned by the first transformer block. We employed our system in the Requests Sub-Challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge. CCMT was declared the winning solution, obtaining an unweighted average recall (UAR) of 65.41% and 85.87% for complaint and request detection, respectively. Moreover, we applied our framework on the Speech Commands v2 and HarperValleyBank dialog data sets, surpassing previous studies reporting results on these benchmarks. Our code is freely available for download at: https://github.com/ristea/ccmt.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Martin, L., Muller, B., Suárez, P.J.O., Dupont, Y., Romary, L., La Clergerie, É.V., Seddah, D., Sagot, B.: CamemBERT: a Tasty French Language Model. In: Proceedings of ACL, pp. 7203–7219 (2020) Chung et al. [2022] Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022) Devlin et al. [2019] Devlin, J., Chang, M.-W., Lee, K., Toutanova, L.K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019) Akbari et al. [2021] Akbari, H., Yuan, L., Qian, R., Chuang, W.-H., Chang, S.-F., Cui, Y., Gong, B.: VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In: Proceedings of NeurIPS, vol. 34, pp. 24206–24221 (2021) Das and Singh [2023] Das, R., Singh, T.D.: Multimodal sentiment analysis: A survey of methods, trends and challenges. ACM Computing Surveys (2023) Georgescu et al. [2023] Georgescu, M.-I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., Arnab, A.: Audiovisual masked autoencoders. In: Proceedings of ICCV, pp. 16144–16154 (2023) Jabeen et al. [2023] Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022) Devlin et al. [2019] Devlin, J., Chang, M.-W., Lee, K., Toutanova, L.K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019) Akbari et al. [2021] Akbari, H., Yuan, L., Qian, R., Chuang, W.-H., Chang, S.-F., Cui, Y., Gong, B.: VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In: Proceedings of NeurIPS, vol. 34, pp. 24206–24221 (2021) Das and Singh [2023] Das, R., Singh, T.D.: Multimodal sentiment analysis: A survey of methods, trends and challenges. ACM Computing Surveys (2023) Georgescu et al. [2023] Georgescu, M.-I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., Arnab, A.: Audiovisual masked autoencoders. In: Proceedings of ICCV, pp. 16144–16154 (2023) Jabeen et al. [2023] Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Devlin, J., Chang, M.-W., Lee, K., Toutanova, L.K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019) Akbari et al. [2021] Akbari, H., Yuan, L., Qian, R., Chuang, W.-H., Chang, S.-F., Cui, Y., Gong, B.: VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In: Proceedings of NeurIPS, vol. 34, pp. 24206–24221 (2021) Das and Singh [2023] Das, R., Singh, T.D.: Multimodal sentiment analysis: A survey of methods, trends and challenges. ACM Computing Surveys (2023) Georgescu et al. [2023] Georgescu, M.-I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., Arnab, A.: Audiovisual masked autoencoders. In: Proceedings of ICCV, pp. 16144–16154 (2023) Jabeen et al. [2023] Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Akbari, H., Yuan, L., Qian, R., Chuang, W.-H., Chang, S.-F., Cui, Y., Gong, B.: VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In: Proceedings of NeurIPS, vol. 34, pp. 24206–24221 (2021) Das and Singh [2023] Das, R., Singh, T.D.: Multimodal sentiment analysis: A survey of methods, trends and challenges. ACM Computing Surveys (2023) Georgescu et al. [2023] Georgescu, M.-I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., Arnab, A.: Audiovisual masked autoencoders. In: Proceedings of ICCV, pp. 16144–16154 (2023) Jabeen et al. [2023] Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Das, R., Singh, T.D.: Multimodal sentiment analysis: A survey of methods, trends and challenges. ACM Computing Surveys (2023) Georgescu et al. [2023] Georgescu, M.-I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., Arnab, A.: Audiovisual masked autoencoders. In: Proceedings of ICCV, pp. 16144–16154 (2023) Jabeen et al. [2023] Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Georgescu, M.-I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., Arnab, A.: Audiovisual masked autoencoders. In: Proceedings of ICCV, pp. 16144–16154 (2023) Jabeen et al. [2023] Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  2. Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022) Devlin et al. [2019] Devlin, J., Chang, M.-W., Lee, K., Toutanova, L.K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019) Akbari et al. [2021] Akbari, H., Yuan, L., Qian, R., Chuang, W.-H., Chang, S.-F., Cui, Y., Gong, B.: VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In: Proceedings of NeurIPS, vol. 34, pp. 24206–24221 (2021) Das and Singh [2023] Das, R., Singh, T.D.: Multimodal sentiment analysis: A survey of methods, trends and challenges. ACM Computing Surveys (2023) Georgescu et al. [2023] Georgescu, M.-I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., Arnab, A.: Audiovisual masked autoencoders. In: Proceedings of ICCV, pp. 16144–16154 (2023) Jabeen et al. [2023] Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Devlin, J., Chang, M.-W., Lee, K., Toutanova, L.K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019) Akbari et al. [2021] Akbari, H., Yuan, L., Qian, R., Chuang, W.-H., Chang, S.-F., Cui, Y., Gong, B.: VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In: Proceedings of NeurIPS, vol. 34, pp. 24206–24221 (2021) Das and Singh [2023] Das, R., Singh, T.D.: Multimodal sentiment analysis: A survey of methods, trends and challenges. ACM Computing Surveys (2023) Georgescu et al. [2023] Georgescu, M.-I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., Arnab, A.: Audiovisual masked autoencoders. In: Proceedings of ICCV, pp. 16144–16154 (2023) Jabeen et al. [2023] Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Akbari, H., Yuan, L., Qian, R., Chuang, W.-H., Chang, S.-F., Cui, Y., Gong, B.: VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In: Proceedings of NeurIPS, vol. 34, pp. 24206–24221 (2021) Das and Singh [2023] Das, R., Singh, T.D.: Multimodal sentiment analysis: A survey of methods, trends and challenges. ACM Computing Surveys (2023) Georgescu et al. [2023] Georgescu, M.-I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., Arnab, A.: Audiovisual masked autoencoders. In: Proceedings of ICCV, pp. 16144–16154 (2023) Jabeen et al. [2023] Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Das, R., Singh, T.D.: Multimodal sentiment analysis: A survey of methods, trends and challenges. ACM Computing Surveys (2023) Georgescu et al. [2023] Georgescu, M.-I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., Arnab, A.: Audiovisual masked autoencoders. In: Proceedings of ICCV, pp. 16144–16154 (2023) Jabeen et al. [2023] Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Georgescu, M.-I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., Arnab, A.: Audiovisual masked autoencoders. In: Proceedings of ICCV, pp. 16144–16154 (2023) Jabeen et al. [2023] Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  3. Devlin, J., Chang, M.-W., Lee, K., Toutanova, L.K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019) Akbari et al. [2021] Akbari, H., Yuan, L., Qian, R., Chuang, W.-H., Chang, S.-F., Cui, Y., Gong, B.: VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In: Proceedings of NeurIPS, vol. 34, pp. 24206–24221 (2021) Das and Singh [2023] Das, R., Singh, T.D.: Multimodal sentiment analysis: A survey of methods, trends and challenges. ACM Computing Surveys (2023) Georgescu et al. [2023] Georgescu, M.-I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., Arnab, A.: Audiovisual masked autoencoders. In: Proceedings of ICCV, pp. 16144–16154 (2023) Jabeen et al. [2023] Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Akbari, H., Yuan, L., Qian, R., Chuang, W.-H., Chang, S.-F., Cui, Y., Gong, B.: VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In: Proceedings of NeurIPS, vol. 34, pp. 24206–24221 (2021) Das and Singh [2023] Das, R., Singh, T.D.: Multimodal sentiment analysis: A survey of methods, trends and challenges. ACM Computing Surveys (2023) Georgescu et al. [2023] Georgescu, M.-I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., Arnab, A.: Audiovisual masked autoencoders. In: Proceedings of ICCV, pp. 16144–16154 (2023) Jabeen et al. [2023] Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Das, R., Singh, T.D.: Multimodal sentiment analysis: A survey of methods, trends and challenges. ACM Computing Surveys (2023) Georgescu et al. [2023] Georgescu, M.-I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., Arnab, A.: Audiovisual masked autoencoders. In: Proceedings of ICCV, pp. 16144–16154 (2023) Jabeen et al. [2023] Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Georgescu, M.-I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., Arnab, A.: Audiovisual masked autoencoders. In: Proceedings of ICCV, pp. 16144–16154 (2023) Jabeen et al. [2023] Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  4. Akbari, H., Yuan, L., Qian, R., Chuang, W.-H., Chang, S.-F., Cui, Y., Gong, B.: VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In: Proceedings of NeurIPS, vol. 34, pp. 24206–24221 (2021) Das and Singh [2023] Das, R., Singh, T.D.: Multimodal sentiment analysis: A survey of methods, trends and challenges. ACM Computing Surveys (2023) Georgescu et al. [2023] Georgescu, M.-I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., Arnab, A.: Audiovisual masked autoencoders. In: Proceedings of ICCV, pp. 16144–16154 (2023) Jabeen et al. [2023] Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Das, R., Singh, T.D.: Multimodal sentiment analysis: A survey of methods, trends and challenges. ACM Computing Surveys (2023) Georgescu et al. [2023] Georgescu, M.-I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., Arnab, A.: Audiovisual masked autoencoders. In: Proceedings of ICCV, pp. 16144–16154 (2023) Jabeen et al. [2023] Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Georgescu, M.-I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., Arnab, A.: Audiovisual masked autoencoders. In: Proceedings of ICCV, pp. 16144–16154 (2023) Jabeen et al. [2023] Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  5. Das, R., Singh, T.D.: Multimodal sentiment analysis: A survey of methods, trends and challenges. ACM Computing Surveys (2023) Georgescu et al. [2023] Georgescu, M.-I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., Arnab, A.: Audiovisual masked autoencoders. In: Proceedings of ICCV, pp. 16144–16154 (2023) Jabeen et al. [2023] Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Georgescu, M.-I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., Arnab, A.: Audiovisual masked autoencoders. In: Proceedings of ICCV, pp. 16144–16154 (2023) Jabeen et al. [2023] Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  6. Georgescu, M.-I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., Arnab, A.: Audiovisual masked autoencoders. In: Proceedings of ICCV, pp. 16144–16154 (2023) Jabeen et al. [2023] Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  7. Jabeen, S., Li, X., Amin, M.S., Bourahla, O., Li, S., Jabbar, A.: A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19(2s), 1–41 (2023) Yoon et al. [2018] Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  8. Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: Proceedings of SLT Workshop, pp. 112–118 (2018). IEEE Radford et al. [2022] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  9. Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022) Cañete et al. [2020] Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  10. Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of PML4DC (ICLR Workshop) (2020) Ramachandram and Taylor [2017] Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  11. Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017) Gao et al. [2020] Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  12. Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020) Stahlschmidt et al. [2022] Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  13. Stahlschmidt, S.R., Ulfenborg, B., Synnergren, J.: Multimodal deep learning for biomedical data fusion: a review. Briefings in Bioinformatics 23(2), 569 (2022) Ristea and Ionescu [2023] Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  14. Ristea, N.-C., Ionescu, R.T.: Cascaded Cross-Modal Transformer for Request and Complaint Detection. In: Proceedings of ACMMM, pp. 9467–9471 (2023) Schuller et al. [2023] Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  15. Schuller, B.W., Batliner, A., Amiriparian, S., Barnhill, A., Gerczuk, M., Triantafyllopoulos, A., Baird, A., Tzirakis, P., Gagne, C., Cowen, A.S., Lackovic, N., Caraty, M.-J., Montacié, C.: The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In: Proceedings of ACMMM, pp. 9635–9639 (2023) Porjazovski et al. [2023] Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  16. Porjazovski, D., Getman, Y., Grósz, T., Kurimo, M.: Advancing audio emotion and intent recognition with large pre-trained models and bayesian inference. In: Proceedings of ACMMM, pp. 9477–9481 (2023) Purwins et al. [2019] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  17. Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., Sainath, T.: Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing 13(2), 206–219 (2019) Ristea and Ionescu [2020] Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  18. Ristea, N.-C., Ionescu, R.T.: Are You Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In: Proceedings of INTERSPEECH, pp. 2102–2106 (2020) Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  19. Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  20. Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proceedings of INTERSPEECH, pp. 571–575 (2021) Ristea et al. [2022] Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  21. Ristea, N.C., Ionescu, R.T., Khan, F.: SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH, pp. 4103–4107 (2022) Gemmeke et al. [2017] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  22. Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio Set: An ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP, pp. 776–780 (2017). IEEE Gong et al. [2022] Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  23. Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J.: SSAST: Self-Supervised Audio Spectrogram Transformer. In: Proceedings of AAAI, vol. 36, pp. 10699–10709 (2022) Huang et al. [2022] Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  24. Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked autoencoders that listen. Proceedings of NeurIPS 35, 28708–28720 (2022) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  25. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of ICLR (2021) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  26. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS, pp. 5998–6008 (2017) Minaee et al. [2021] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  27. Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Computing Surveys 54(3), 1–40 (2021) Gasparetto et al. [2022] Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  28. Gasparetto, A., Marcuzzo, M., Zangari, A., Albarelli, A.: A survey on text classification algorithms: From text to predictions. Information 13(2), 83 (2022) Khadhraoui et al. [2022] Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  29. Khadhraoui, M., Bellaaj, H., Ammar, M.B., Hamam, H., Jmaiel, M.: Survey of BERT-base models for scientific text classification: COVID-19 case study. Applied Sciences 12(6), 2891 (2022) Wan and Li [2022] Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  30. Wan, C.-X., Li, B.: Financial causal sentence recognition based on BERT-CNN text classification. The Journal of Supercomputing, 1–25 (2022) Yang et al. [2022] Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  31. Yang, C.-H.H., Qi, J., Chen, S.Y.-C., Tsao, Y., Chen, P.-Y.: When BERT meets quantum temporal convolution learning for text classification in heterogeneous computing. In: Proceedings of ICASSP, pp. 8602–8606 (2022). IEEE Dumitrescu et al. [2020] Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  32. Dumitrescu, Ştefan Daniel., Avram, A.-M., Pyysalo, S.: The birth of Romanian BERT. In: Proceedings of EMNLP, pp. 4324–4328 (2020) Bhaskar et al. [2015] Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  33. Bhaskar, J., Sruthi, K., Nedungadi, P.: Hybrid approach for emotion classification of audio conversation based on text and speech mining. Procedia Computer Science 46, 635–643 (2015) Yoon et al. [2020] Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  34. Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39(6), 1–16 (2020) Abdu et al. [2021] Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  35. Abdu, S.A., Yousef, A.H., Salem, A.: Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76, 204–226 (2021) Pandeya et al. [2021] Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  36. Pandeya, Y.R., Bhattarai, B., Lee, J.: Deep-learning-based multimodal emotion classification for music videos. Sensors 21(14), 4927 (2021) Sun et al. [2020] Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  37. Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of AAAI, vol. 34, pp. 8992–8999 (2020) Singh et al. [2021] Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  38. Singh, P., Srivastava, R., Rana, K.P.S., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems 229, 107316 (2021) Toto et al. [2021] Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  39. Toto, E., Tlachac, M.L., Rundensteiner, E.A.: AudiBERT: A deep transfer learning multimodal classification framework for depression screening. In: Proceedings of CIKM, pp. 4145–4154 (2021) Boulahia et al. [2021] Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  40. Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32(6), 121 (2021) Huang et al. [2020] Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  41. Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digital Medicine 3(1), 136 (2020) Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  42. Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Proceedings of NeurIPS 33, 4835–4845 (2020) Huang et al. [2020] Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  43. Huang, J., Tao, J., Liu, B., Lian, Z., Niu, M.: Multimodal transformer fusion for continuous emotion recognition. In: Proceedings of ICASSP, pp. 3507–3511 (2020). IEEE Li et al. [2023] Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  44. Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient Multimodal Fusion via Interactive Prompting. In: Proceedings of CVPR, pp. 2604–2613 (2023) Pawłowski et al. [2023] Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  45. Pawłowski, M., Wróblewska, A., Sysko-Romańczuk, S.: Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23(5), 2381 (2023) Xu et al. [2023] Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  46. Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) Shvetsova et al. [2022] Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  47. Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of CVPR, pp. 20020–20029 (2022) Lee et al. [2022] Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  48. Lee, W.-Y., Jovanov, L., Philips, W.: Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Proceedings of ECCV, pp. 608–623 (2022). Springer Liu et al. [2023] Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  49. Liu, Z., Cheng, Q., Song, C., Cheng, J.: Cross-scale cascade transformer for multimodal human action recognition. Pattern Recognition Letters 168, 17–23 (2023) Lackovic et al. [2022] Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  50. Lackovic, N., Montacié, C., Lalande, G., Caraty, M.-J.: Prediction of user request and complaint in spoken customer-agent conversations. arXiv preprint arXiv:2208.10249 (2022) Warden [2018] Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  51. Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209 (2018) Wu et al. [2020] Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  52. Wu, M., Nafziger, J., Scodary, A., Maas, A.: HarperValleyBank: A domain-specific spoken dialog corpus. arXiv preprint arXiv:2010.13929 (2020) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  53. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016) Le et al. [2020] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  54. Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of LREC, pp. 2479–2490 (2020) Sun et al. [2023] Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  55. Sun, Y., Xu, K., Liu, C., Dou, Y., Qian, K.: Automatic audio augmentation for requests sub-challenge. In: Proceedings of ACMMM, pp. 9482–9486 (2023) Majumdar and Ginsburg [2020] Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  56. Majumdar, S., Ginsburg, B.: MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In: Proceedings of INTERSPEECH, pp. 3356–3360 (2020) Thomas et al. [2022] Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  57. Thomas, S., Kuo, H.-K.J., Kingsbury, B., Saon, G.: Towards reducing the need for speech training data to build spoken language understanding systems. In: Proceedings of ICASSP, pp. 7932–7936 (2022). IEEE Sunder et al. [2022] Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  58. Sunder, V., Thomas, S., Kuo, H.-K.J., Ganhotra, J., Kingsbury, B., Fosler-Lussier, E.: Towards end-to-end integration of dialog history for improved spoken language understanding. In: Proceedings of ICASSP, pp. 7497–7501 (2022). IEEE Kingma and Ba [2014] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  59. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR (2014) Hochreiter and Schmidhuber [1997] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  60. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) Liu et al. [2019] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
  61. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com