Listen As You Wish: Audio based Event Detection via Text-to-Audio Grounding in Smart Cities
Abstract: With the development of internet of things technologies, tremendous sensor audio data has been produced, which poses great challenges to audio-based event detection in smart cities. In this paper, we target a challenging audio-based event detection task, namely, text-to-audio grounding. In addition to precisely localizing all of the desired on- and off-sets in the untrimmed audio, this challenging new task requires extensive acoustic and linguistic comprehension as well as the reasoning for the crossmodal matching relations between the audio and query. The current approaches often treat the query as an entire one through a global query representation in order to address those issues. We contend that this strategy has several drawbacks. Firstly, the interactions between the query and the audio are not fully utilized. Secondly, it has not distinguished the importance of different keywords in a query. In addition, since the audio clips are of arbitrary lengths, there exist many segments which are irrelevant to the query but have not been filtered out in the approach. This further hinders the effective grounding of desired segments. Motivated by the above concerns, a novel Cross-modal Graph Interaction (CGI) model is proposed to comprehensively model the relations between the words in a query through a novel language graph. To capture the fine-grained relevances between the audio and query, a cross-modal attention module is introduced to generate snippet-specific query representations and automatically assign higher weights to keywords with more important semantics. Furthermore, we develop a cross-gating module for the audio and query to weaken irrelevant parts and emphasize the important ones.
- W. Ding, X. Jing, Z. Yan, and L. Yang, “A survey on data fusion in internet of things: Towards secure and privacy-preserving fusion,” Information Fusion, vol. 51, no. 1, pp. 129–144, 2019.
- Q. Jun, Y. Po, N. Lee, P. Xiyang, Y. Yun, and Z. Zhong, “An overview of data fusion techniques for Internet of Things enabled physical activity recognition and measure,” Information Fusion, vol. 55, pp. 269–280, 2020.
- Y. Ma, Y. Hao, M. Chen, J. Chen, P. Lu, and A. Košir, “Audio-visual emotion fusion (AVEF): A deep efficient weighted approach,” Information Fusion, vol. 46, pp. 184–192, 2019.
- L. Passos, J. Papa, J. Del Ser, A. Hussain, A. Adeel, “Multimodal audio-visual information fusion using canonical-correlated graph neural network for energy-efficient speech enhancement,” Information Fusion, vol. 90, pp. 1–11, 2023.
- A. Aslam, “Detecting objects in less response time for processing multimedia events in smart cities,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2044–2054.
- C. Stiller, F. León, and M. Kruse, “Information fusion for automotive applications–An overview”, Information fusion, vol. 12, no. 4, pp. 244–252, 2011.
- M. G. al Zamil, S. Samarah, M. Rawashdeh, A. Karime, and M. S. Hossain, “Multimedia-oriented action recognition in smart city-based iot using multilayer perceptron,” Multimedia Tools and Applications, vol. 78, no. 21, pp. 30 315–30 329, 2019.
- M. Wu, H. Dinkel, and K. Yu, “Audio caption: Listen and tell,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 830–834.
- Y. Wu, K. Chen, Z. Wang, X. Zhang, F. Nian, S. Li, and X. Shao, “Audio captioning based on transformer and pre-training for 2020 dcase audio captioning challenge,” DCASE2020 Challenge, Tech. Rep, Tech. Rep., 2020.
- X. Xu, H. Dinkel, M. Wu, and K. Yu, “Text-to-audio grounding: Building correspondence between captions and sound events,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 606–610.
- K. Imoto, S. Shimauchi, H. Uematsu, and H. Ohmuro, “User activity estimation method based on probabilistic generative model of acoustic event sequence with user activity and its subordinate categories.” in INTERSPEECH, vol. 2013, 2013, pp. 2609–2613.
- P. Gerstoft, Y. Hu, M. J. Bianco, C. Patil, A. Alegre, Y. Freund, and F. Grondin, “Audio scene monitoring using redundant ad hoc microphone array networks,” IEEE Internet of Things Journal, vol. 9, no. 6, pp. 4259–4268, 2021.
- E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen, “Polyphonic sound event detection using multi label deep neural networks,” in 2015 international joint conference on neural networks (IJCNN). IEEE, 2015, pp. 1–7.
- H. Zhang, I. McLoughlin, and Y. Song, “Robust sound event recognition using convolutional neural networks,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 559–563.
- J. Salamon and J. P. Bello, “Deep convolutional neural networks and data augmentation for environmental sound classification,” IEEE Signal Processing Letters, vol. 24, no. 3, pp. 279–283, 2017.
- G. Parascandolo, H. Huttunen, and T. Virtanen, “Recurrent neural networks for polyphonic sound event detection in real life recordings,” in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016, pp. 6440–6444.
- Y. Xu, Q. Kong, Q. Huang, W. Wang, and M. D. Plumbley, “Convolutional gated recurrent neural network incorporating spatial features for audio tagging,” in 2017 International Joint Conference on Neural Networks (IJCNN). IEEE, 2017, pp. 3461–3466.
- K. Miyazaki, T. Komatsu, T. Hayashi, S. Watanabe, T. Toda, and K. Takeda, “Weakly-supervised sound event detection with self-attention,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 66–70.
- T. N. T. Nguyen, D. L. Jones, and W.-S. Gan, “A sequence matching network for polyphonic sound event localization and detection,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 71–75.
- Ç. Bilen, G. Ferroni, F. Tuveri, J. Azcarreta, and S. Krstulović, “A framework for the robust evaluation of sound event detection,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 61–65.
- A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele, “Grounding of textual phrases in images by reconstruction,” in European Conference on Computer Vision. Springer, 2016, pp. 817–834.
- J. Gao, C. Sun, Z. Yang, and R. Nevatia, “Tall: Temporal activity localization via language query,” in The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
- M. Liu, X. Wang, L. Nie, Q. Tian, B. Chen, and T.-S. Chua, “Cross-modal moment localization in videos,” in 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 2018, pp. 843–851.
- H. Tang, J. Zhu, M. Liu, Z. Gao, and Z. Cheng, “Frame-wise cross-modal matching for video moment retrieval,” IEEE Transactions on Multimedia, 2021.
- Z. Mu, S. Tang, J. Tan, Q. Yu, and Y. Zhuang, “Disentangled motif-aware graph learning for phrase grounding,” in Proc 35 AAAI Conf on Artificial Intelligence, 2021.
- J. Chen, X. Chen, L. Ma, Z. Jie, and T.-S. Chua, “Temporally grounding natural sentence in video,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 162–171.
- D. Liu, X. Qu, X.-Y. Liu, J. Dong, P. Zhou, and Z. Xu, “Jointly cross-and self-modal graph attention network for query-based moment localization,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 4070–4078.
- F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE transactions on neural networks, vol. 20, no. 1, pp. 61–80, 2008.
- D. Beck, G. Haffari, and T. Cohn, “Graph-to-sequence learning using gated graph neural networks,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 273–283.
- D. Marcheggiani and I. Titov, “Encoding sentences with graph convolutional networks for semantic role labeling,” in EMNLP 2017-Conference on Empirical Methods in Natural Language Processing, Proceedings, 2017, pp. 1506–1515.
- Y. Zhang, P. Qi, and C. D. Manning, “Graph convolution over pruned dependency trees improves relation extraction,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, Oct.-Nov. 2018, pp. 2205–2215. [Online]. Available: https://www.aclweb.org/anthology/D18-1244
- Q. Huang, J. Wei, Y. Cai, C. Zheng, J. Chen, H.-f. Leung, and Q. Li, “Aligned dual channel graph convolutional network for visual question answering,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7166–7176.
- X. Xu, H. Dinkel, M. Wu, and K. Yu, “A crnn-gru based reinforcement learning approach to audio captioning,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), 2020, pp. 225–229.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017.
- Y. Feng, L. Ma, W. Liu, T. Zhang, and J. Luo, “Video re-localization,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 51–66.
- J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 776–780.
- C. D. Kim, B. Kim, H. Lee, and G. Kim, “AudioCaps: Generating captions for audios in the wild,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 119–132. [Online]. Available: https://www.aclweb.org/anthology/N19-1011
- C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky, “The Stanford CoreNLP natural language processing toolkit,” in Association for Computational Linguistics (ACL) System Demonstrations, 2014, pp. 55–60. [Online]. Available: http://www.aclweb.org/anthology/P/P14/P14-5010
- S. Wang and J. Jiang, “Learning natural language inference with lstm,” in HLT-NAACL, 2016.
- X. You, J. Lu and J. Xue, “Safety early warning and control system of expressway confluence zone based on vehicle-road cooperation,” in International Conference on Measuring Technology and Mechatronics Automation, 2022, pp. 236-241.
- E. Sun, Z. Chen and J. Cai, “Cloud Control Platform of Vehicle and Road Collaborative and its Implementation on Intelligent Networked Vehicles,” in IEEE International Conference on Emergency Science and Information Technology, 2021, pp. 274-276.
- N. Lu, N. Cheng, N. Zhang, X. Shen and J. W. Mark, “Connected Vehicles: Solutions and Challenges,” in IEEE Internet of Things Journal, vol. 1, no. 4, pp. 289-299, 2014.
- B. Wu and X. -P. Zhang, “Environmental Sound Classification via Time–Frequency Attention and Framewise Self-Attention-Based Deep Neural Networks,” in IEEE Internet of Things Journal, vol. 9, no. 5, pp. 3416-3428, 1 March1, 2022.
- C. Chen, G. Yao, L. Liu, Q. Pei, H. Song and S. Dustdar, “A Cooperative Vehicle-Infrastructure System for Road Hazards Detection With Edge Intelligence,” in IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 5, pp. 5186-5198, May 2023.
- S. Chandrakala and S. L. Jayalakshmi, “Environmental audio scene and sound event recognition for autonomous surveillance: A survey and comparative studies,” in CM Computing Surveys, vol. 52, no. 3, pp. 1-34, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.