Visual and audio scene classification for detecting discrepancies in video: a baseline method and experimental protocol
Abstract: This paper presents a baseline approach and an experimental protocol for a specific content verification problem: detecting discrepancies between the audio and video modalities in multimedia content. We first design and optimize an audio-visual scene classifier, to compare with existing classification baselines that use both modalities. Then, by applying this classifier separately to the audio and the visual modality, we can detect scene-class inconsistencies between them. To facilitate further research and provide a common evaluation platform, we introduce an experimental protocol and a benchmark dataset simulating such inconsistencies. Our approach achieves state-of-the-art results in scene classification and promising outcomes in audio-visual discrepancies detection, highlighting its potential in content verification applications.
- Jakob Abeßer. 2020. A Review of Deep Learning Based Methods for Acoustic Scene Classification. Applied Sciences 10, 6 (2020).
- How Robust are Audio Embeddings for Polyphonic Sound Event Tagging? IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023), 2658–2667.
- Spotting Audio-Visual Inconsistencies (SAVI) in Manipulated Video.. In CVPR Workshops. 1907–1914.
- Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK, 3852–3856.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR).
- Unsupervised Adversarial Domain Adaptation Based on the Wasserstein Distance for Acoustic Scene Classification. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 259–263. https://doi.org/10.1109/WASPAA.2019.8937231
- Audio Set: An Ontology and Human-Labeled Dataset for Audio Events. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). New Orleans, LA, USA, 776–780.
- Sascha Grollmisch and EstefanÃa Cano. 2021. Improving Semi-Supervised Learning for Audio Classification with FixMatch. Electronics 10, 15 (2021).
- Deep Residual Learning for Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- A Joint Network Based on Interactive Attention for Speech Emotion Recognition. In 2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1715–1720.
- Densely Connected Convolutional Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- David Johnson and Sascha Grollmisch. 2021. Techniques Improving the Robustness of Deep Learning Models for Industrial Sound Analysis. In Proceedings of the 28th European Signal Processing Conference (EUSIPCO). Online, 81–85. https://doi.org/10.23919/Eusipco47968.2020.9287327
- QTI Submission to DCASE 2021: Residual Normalization for Device-Imbalanced Acoustic Scene Classification with Efficient Design. Technical Report. DCASE2020 Challenge.
- PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2021), 2880–2894.
- Unsupervised Domain Adaptation for Acoustic Scene Classification Using Band-Wise Statistics Matching. In Procedings of the 28th European Signal Processing Conference (EUSIPCO). Amsterdam, The Netherlands, 11–15. https://doi.org/10.23919/Eusipco47968.2020.9287533
- Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning. PMLR, 8748–8763.
- Guoying Sun and Meng Yang. 2023. Self-Attention Prediction Correction with Channel Suppression for Weakly-Supervised Semantic Segmentation. In 2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 846–851.
- Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In International Conference on Machine Learning. PMLR, 6105–6114.
- The InVID Plug-in: Web Video Verification on the Browser. In Proceedings of the 1st International Workshop on Multimedia Verification. 23–30.
- Attention Is All You Need. Advances in Neural Information Processing Systems 30 (2017).
- Computational Analysis of Sound Scenes and Events (1st ed.). Springer International Publishing.
- Exploring clip for assessing the look and feel of images. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 2555–2563.
- Audio-Visual Scene Classification Using Transfer Learning and Hybrid Fusion Strategy. DCASE2021 Challenge, Tech. Rep, Tech. Rep. (2021).
- A Curated Dataset of Urban Scenes for Audio-Visual Scene Analysis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 626–630.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.