Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization (2210.05242v2)
Abstract: Audio-visual event (AVE) localization has attracted much attention in recent years. Most existing methods are often limited to independently encoding and classifying each video segment separated from the full video (which can be regarded as the segment-level representations of events). However, they ignore the semantic consistency of the event within the same full video (which can be considered as the video-level representations of events). In contrast to existing methods, we propose a novel video-level semantic consistency guidance network for the AVE localization task. Specifically, we propose an event semantic consistency modeling (ESCM) module to explore video-level semantic information for semantic consistency modeling. It consists of two components: a cross-modal event representation extractor (CERE) and an intra-modal semantic consistency enhancer (ISCE). CERE is proposed to obtain the event semantic information at the video level. Furthermore, ISCE takes video-level event semantics as prior knowledge to guide the model to focus on the semantic continuity of an event within each modality. Moreover, we propose a new negative pair filter loss to encourage the network to filter out the irrelevant segment pairs and a new smooth loss to further increase the gap between different categories of events in the weakly-supervised setting. We perform extensive experiments on the public AVE dataset and outperform the state-of-the-art methods in both fully- and weakly-supervised settings, thus verifying the effectiveness of our method.The code is available at https://github.com/Bravo5542/VSCG.
- A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba, “Ambient sound provides supervision for visual learning,” in ECCV, 2016, pp. 801–816.
- B. E. Stein, M. A. Meredith, W. S. Huneycutt, and L. McDade, “Behavioral indices of multisensory integration: orientation to visual cues is affected by auditory stimuli,” Journal of Cognitive Neuroscience, vol. 1, no. 1, pp. 12–24, 1989.
- Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu, “Audio-visual event localization in unconstrained videos,” in ECCV, 2018, pp. 247–263.
- H. Xu, R. Zeng, Q. Wu, M. Tan, and C. Gan, “Cross-modal relation-aware networks for audio-visual event localization,” in ACMMM, 2020, pp. 3893–3901.
- J. Zhou, L. Zheng, Y. Zhong, S. Hao, and M. Wang, “Positive sample propagation along the audio-visual event line,” in CVPR, 2021, pp. 8436–8444.
- T. Mahmud and D. Marculescu, “Ave-clip: Audioclip-based multi-window temporal transformer for audio visual event localization,” in WACV, 2023, pp. 5158–5167.
- Y.-B. Lin, Y.-J. Li, and Y.-C. F. Wang, “Dual-modality seq2seq network for audio-visual event localization,” in ICASSP, 2019, pp. 2002–2006.
- Y.-B. Lin and Y.-C. F. Wang, “Audiovisual transformer with instance attention for audio-visual event localization,” in ACCV, 2020.
- H. Xuan, Z. Zhang, S. Chen, J. Yang, and Y. Yan, “Cross-modal attention network for temporal inconsistent audio-visual event localization,” in AAAI, vol. 34, no. 01, 2020, pp. 279–286.
- J. Ramaswamy and S. Das, “See the sound, hear the pixels,” in WACV, 2020, pp. 2970–2979.
- C. Xue, X. Zhong, M. Cai, H. Chen, and W. Wang, “Audio-visual event localization by learning spatial and semantic co-attention,” TMM, pp. 1–1, 2021.
- S. Liu, W. Quan, C. Wang, Y. Liu, B. Liu, and D.-M. Yan, “Dense modality interaction network for audio-visual event localization,” TMM, pp. 1–1, 2022.
- H. Xuan, L. Luo, Z. Zhang, J. Yang, and Y. Yan, “Discriminative cross-modality attention network for temporal inconsistent audio-visual event localization,” TIP, vol. 30, pp. 7878–7888, 2021.
- Y. Xia and Z. Zhao, “Cross-modal background suppression for audio-visual event localization,” in CVPR, 2022, pp. 19 989–19 998.
- F. Feng, Y. Ming, N. Hu, H. Yu, and Y. Liu, “Css-net: A consistent segment selection network for audio-visual event localization,” TMM, 2023.
- A. Owens and A. A. Efros, “Audio-visual scene analysis with self-supervised multisensory features,” in ECCV, September 2018.
- J.-T. Lee, M. Jain, H. Park, and S. Yun, “Cross-attentional audio-visual fusion for weakly-supervised action localization,” in ICLR, 2021.
- C. Gan, N. Wang, Y. Yang, D.-Y. Yeung, and A. G. Hauptmann, “Devnet: A deep event network for multimedia event detection and evidence recounting,” in CVPR, June 2015.
- R. Zeng, W. Huang, M. Tan, Y. Rong, P. Zhao, J. Huang, and C. Gan, “Graph convolutional networks for temporal action localization,” in ICCV, 2019, pp. 7094–7103.
- D. Hu, F. Nie, and X. Li, “Deep multimodal clustering for unsupervised audiovisual learning,” in CVPR, 2019, pp. 9248–9257.
- R. Zellers, J. Lu, X. Lu, Y. Yu, Y. Zhao, M. Salehi, A. Kusupati, J. Hessel, A. Farhadi, and Y. Choi, “Merlot reserve: Neural script knowledge through vision and language and sound,” in CVPR, 2022, pp. 16 375–16 387.
- R. Arandjelovic and A. Zisserman, “Look, listen and learn,” in ICCV, 2017, pp. 609–617.
- R. Arandjelovic and A. Zisserman, “Objects that sound,” in ECCV, 2018, pp. 435–451.
- Y. Cheng, R. Wang, Z. Pan, R. Feng, and Y. Zhang, “Look, listen, and attend: Co-attention network for self-supervised audio-visual representation learning,” in ACMMM, 2020, pp. 3884–3892.
- P. Wu, X. Liu, and J. Liu, “Weakly supervised audio-visual violence detection,” TMM, pp. 1–1, 2022.
- T. Afouras, A. Owens, J. S. Chung, and A. Zisserman, “Self-supervised learning of audio-visual objects from video,” in ECCV, 2020, pp. 208–224.
- Y. Tian, D. Hu, and C. Xu, “Cyclic co-learning of sounding object visual grounding and sound separation,” in CVPR, 2021, pp. 2745–2754.
- F. R. Valverde, J. V. Hurtado, and A. Valada, “There is more than meets the eye: Self-supervised multi-object detection and tracking with sound by distilling multimodal knowledge,” in CVPR, 2021, pp. 11 612–11 621.
- H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba, “The sound of pixels,” in ECCV, September 2018.
- C. Gan, H. Zhao, P. Chen, D. Cox, and A. Torralba, “Self-supervised moving vehicle tracking with stereo sound,” in ICCV, October 2019.
- Y.-B. Lin, H.-Y. Tseng, H.-Y. Lee, Y.-Y. Lin, and M.-H. Yang, “Unsupervised sound localization via iterative contrastive learning,” CVIU, vol. 227, p. 103602, 2023.
- R. Qian, D. Hu, H. Dinkel, M. Wu, N. Xu, and W. Lin, “Multiple sound sources localization from coarse to fine,” in ECCV, 2020, pp. 292–308.
- Z. Song, Y. Wang, J. Fan, T. Tan, and Z. Zhang, “Self-supervised predictive learning: A negative-free method for sound source localization in visual scenes,” in CVPR, 2022, pp. 3222–3231.
- T. Geng, T. Wang, J. Duan, R. Cong, and F. Zheng, “Dense-localizing audio-visual events in untrimmed videos: A large-scale benchmark and baseline,” in CVPR, 2023, pp. 22 942–22 951.
- H. Wang, Z.-J. Zha, X. Chen, Z. Xiong, and J. Luo, “Dual path interaction network for video moment localization,” in ACMMM, 2020, pp. 4116–4124.
- S. Xiao, L. Chen, S. Zhang, W. Ji, J. Shao, L. Ye, and J. Xiao, “Boundary proposal network for two-stage natural language video localization,” in AAAI, vol. 35, no. 4, 2021, pp. 2986–2994.
- H. Wang, Z.-J. Zha, L. Li, D. Liu, and J. Luo, “Structured multi-level interaction network for video moment localization via language query,” in CVPR, 2021, pp. 7026–7035.
- M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” TSP, vol. 45, no. 11, pp. 2673–2681, 1997.
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
- K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
- Y. Wu, L. Zhu, Y. Yan, and Y. Yang, “Dual attention matching for audio-visual event localization,” in ICCV, 2019, pp. 6292–6300.
- J. Ramaswamy, “What makes the sound?: A dual-modality interacting network for audio-visual event localization,” in ICASSP, 2020, pp. 4372–4376.
- J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in ICASSP, 2017, pp. 776–780.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” IJCV, vol. 115, no. 3, pp. 211–252, 2015.
- S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold et al., “Cnn architectures for large-scale audio classification,” in ICASSP, 2017, pp. 131–135.
- Yuanyuan Jiang (8 papers)
- Jianqin Yin (53 papers)
- Yonghao Dang (20 papers)