Papers
Topics
Authors
Recent
Search
2000 character limit reached

Visual and audio scene classification for detecting discrepancies in video: a baseline method and experimental protocol

Published 1 May 2024 in cs.CV, cs.MM, cs.SD, and eess.AS | (2405.00384v1)

Abstract: This paper presents a baseline approach and an experimental protocol for a specific content verification problem: detecting discrepancies between the audio and video modalities in multimedia content. We first design and optimize an audio-visual scene classifier, to compare with existing classification baselines that use both modalities. Then, by applying this classifier separately to the audio and the visual modality, we can detect scene-class inconsistencies between them. To facilitate further research and provide a common evaluation platform, we introduce an experimental protocol and a benchmark dataset simulating such inconsistencies. Our approach achieves state-of-the-art results in scene classification and promising outcomes in audio-visual discrepancies detection, highlighting its potential in content verification applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Jakob Abeßer. 2020. A Review of Deep Learning Based Methods for Acoustic Scene Classification. Applied Sciences 10, 6 (2020).
  2. How Robust are Audio Embeddings for Polyphonic Sound Event Tagging? IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023), 2658–2667.
  3. Spotting Audio-Visual Inconsistencies (SAVI) in Manipulated Video.. In CVPR Workshops. 1907–1914.
  4. Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK, 3852–3856.
  5. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR).
  6. Unsupervised Adversarial Domain Adaptation Based on the Wasserstein Distance for Acoustic Scene Classification. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 259–263. https://doi.org/10.1109/WASPAA.2019.8937231
  7. Audio Set: An Ontology and Human-Labeled Dataset for Audio Events. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). New Orleans, LA, USA, 776–780.
  8. Sascha Grollmisch and Estefanía Cano. 2021. Improving Semi-Supervised Learning for Audio Classification with FixMatch. Electronics 10, 15 (2021).
  9. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  10. A Joint Network Based on Interactive Attention for Speech Emotion Recognition. In 2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1715–1720.
  11. Densely Connected Convolutional Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  12. David Johnson and Sascha Grollmisch. 2021. Techniques Improving the Robustness of Deep Learning Models for Industrial Sound Analysis. In Proceedings of the 28th European Signal Processing Conference (EUSIPCO). Online, 81–85. https://doi.org/10.23919/Eusipco47968.2020.9287327
  13. QTI Submission to DCASE 2021: Residual Normalization for Device-Imbalanced Acoustic Scene Classification with Efficient Design. Technical Report. DCASE2020 Challenge.
  14. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2021), 2880–2894.
  15. Unsupervised Domain Adaptation for Acoustic Scene Classification Using Band-Wise Statistics Matching. In Procedings of the 28th European Signal Processing Conference (EUSIPCO). Amsterdam, The Netherlands, 11–15. https://doi.org/10.23919/Eusipco47968.2020.9287533
  16. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning. PMLR, 8748–8763.
  17. Guoying Sun and Meng Yang. 2023. Self-Attention Prediction Correction with Channel Suppression for Weakly-Supervised Semantic Segmentation. In 2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 846–851.
  18. Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In International Conference on Machine Learning. PMLR, 6105–6114.
  19. The InVID Plug-in: Web Video Verification on the Browser. In Proceedings of the 1st International Workshop on Multimedia Verification. 23–30.
  20. Attention Is All You Need. Advances in Neural Information Processing Systems 30 (2017).
  21. Computational Analysis of Sound Scenes and Events (1st ed.). Springer International Publishing.
  22. Exploring clip for assessing the look and feel of images. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 2555–2563.
  23. Audio-Visual Scene Classification Using Transfer Learning and Hybrid Fusion Strategy. DCASE2021 Challenge, Tech. Rep, Tech. Rep. (2021).
  24. A Curated Dataset of Urban Scenes for Audio-Visual Scene Analysis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 626–630.
Citations (1)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.