MIS-AVoiDD: Modality Invariant and Specific Representation for Audio-Visual Deepfake Detection (2310.02234v2)
Abstract: Deepfakes are synthetic media generated using deep generative algorithms and have posed a severe societal and political threat. Apart from facial manipulation and synthetic voice, recently, a novel kind of deepfakes has emerged with either audio or visual modalities manipulated. In this regard, a new generation of multimodal audio-visual deepfake detectors is being investigated to collectively focus on audio and visual data for multimodal manipulation detection. Existing multimodal (audio-visual) deepfake detectors are often based on the fusion of the audio and visual streams from the video. Existing studies suggest that these multimodal detectors often obtain equivalent performances with unimodal audio and visual deepfake detectors. We conjecture that the heterogeneous nature of the audio and visual signals creates distributional modality gaps and poses a significant challenge to effective fusion and efficient performance. In this paper, we tackle the problem at the representation level to aid the fusion of audio and visual streams for multimodal deepfake detection. Specifically, we propose the joint use of modality (audio and visual) invariant and specific representations. This ensures that the common patterns and patterns specific to each modality representing pristine or fake content are preserved and fused for multimodal deepfake manipulation detection. Our experimental results on FakeAVCeleb and KoDF audio-visual deepfake datasets suggest the enhanced accuracy of our proposed method over SOTA unimodal and multimodal audio-visual deepfake detectors by $17.8$% and $18.4$%, respectively. Thus, obtaining state-of-the-art performance.
- “Proactive deepfake detection using gan-based visible watermarking,” ACM Trans. Multimedia Comput. Commun. Appl., Sep 2023.
- Tim Hwang, “Deepfakes: A grounded threat assessment,” Tech. Rep., Georgetown University, July 2020.
- Danielle Citron, “How deepfakes undermine truth and threaten democracy,” .
- “Recurrent convolutional structures for audio spoof and video deepfake detection,” IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 5, pp. 1024–1037, 2020.
- “Deepfake audio detection by speaker verification,” in 2022 IEEE International Workshop on Information Forensics and Security (WIFS). IEEE, 2022, pp. 1–6.
- “Face x-ray for more general face forgery detection,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 5000–5009.
- “Lips don’t lie: A generalisable and robust approach to face forgery detection,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 5037–5047.
- “Detecting deep-fake videos from appearance and behavior,” in 2020 IEEE International Workshop on Information Forensics and Security (WIFS), 2020, pp. 1–6.
- “Voice recognition algorithms using mel frequency cepstral coefficient (mfcc) and dynamic time warping (dtw) techniques,” CoRR, vol. abs/1003.4083, 2010.
- “Deepfake audio detection via mfcc features using machine learning,” IEEE Access, vol. 10, pp. 134018–134028, 2022.
- “Evaluation of an audio-video multimodal deepfake dataset using unimodal and multimodal detectors,” in Proceedings of the 1st Workshop on Synthetic Multimedia - Audiovisual Deepfake Generation and Detection, New York, NY, USA, 2021, ADGD ’21, p. 7–15, Association for Computing Machinery.
- “Avfakenet: A unified end-to-end dense swin transformer deep learning model for audio–visual deepfakes detection,” Applied Soft Computing, vol. 136, pp. 110124, 2023.
- “Emotions don’t lie: An audio-visual deepfake detection method using affective cues,” in Proceedings of the 28th ACM International Conference on Multimedia, New York, NY, USA, 2020, MM ’20, p. 2823–2832, Association for Computing Machinery.
- “Not made for each other- audio-visual dissonance-based deepfake detection and localization,” in Proceedings of the 28th ACM International Conference on Multimedia, New York, NY, USA, 2020, MM ’20, p. 439–447, Association for Computing Machinery.
- “Audio-visual deep neural network for robust person verification,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1079–1092, 2021.
- “Hearing and seeing abnormality: Self-supervised audio-visual mutual learning for deepfake detection,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- “Lip sync matters: A novel multimodal forgery detector,” in 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2022, pp. 1885–1892.
- “Audio-visual person-of-interest deepfake detection,” ArXiv, vol. abs/2204.03083, 2022.
- “Voice-face homogeneity tells deepfake,” arXiv preprint arXiv:2203.02195, 2022.
- “Self-supervised video forensics by audio-visual anomaly detection,” arXiv preprint arXiv:2301.01767, 2023.
- “Multimodaltrace: Deepfake detection using audiovisual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 993–1000.
- “Npvforensics: Jointing non-critical phonemes and visemes for deepfake detection,” arXiv preprint arXiv:2306.06885, 2023.
- “Joint audio-visual deepfake detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14800–14809.
- “Demographic fairness and accountability of audio and video-based unimodal and bi-modal deepfake detectors,” in Face Recognition Across the Imaging Spectrum (FRAIS), Thirimachos Bourlai, Ed. Springer, 2023.
- “Fakeavceleb: A novel audio-video multimodal deepfake dataset,” CoRR, vol. abs/2108.05080, 2021.
- “Kodf: A large-scale korean deepfake detection dataset,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 10744–10753.
- “Avoid-df: Audio-visual joint learning for detecting deepfake,” IEEE Transactions on Information Forensics and Security, vol. 18, pp. 2015–2029, 2023.
- J. Kittler, “Pattern classification: Fusion of information,” in International Conference on Advances in Pattern Recognition, Sameer Singh, Ed., London, 1999, pp. 13–22, Springer London.
- “Multimodal machine translation through visuals and speech,” Machine Translation, vol. 34, pp. 97–147, 2020.
- Hichem Sahbi, “Learning cca representations for misaligned data,” in Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 0–0.
- “Learning factorized multimodal representations,” arXiv preprint arXiv:1806.06176, 2018.
- “Harmonized multimodal learning with gaussian process latent variable models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 3, pp. 858–872, 2021.
- “VoxCeleb2: Deep Speaker Recognition,” in Proc. Interspeech 2018, 2018, pp. 1086–1090.
- “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.
- “Gbdf: Gender balanced deepfake dataset towards fair deepfake detection,” ArXiv, vol. abs/2207.10246, 2022.
- “Exposing deep fakes using inconsistent head poses,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 8261–8265.
- “Exploiting visual artifacts to expose deepfakes and face manipulations,” in 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW). IEEE, 2019, pp. 83–92.
- “Defakehop: A light-weight high-performance deepfake detector,” in 2021 IEEE International conference on Multimedia and Expo (ICME). IEEE, 2021, pp. 1–6.
- “Deepfake video detection using convolutional vision transformer,” arXiv preprint arXiv:2102.11126, 2021.
- “Multi-attentional deepfake detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 2185–2194.
- “Self-supervised learning of adversarial example: Towards good generalizations for deepfake detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 18710–18719.
- “End-to-end detection of attacks to automatic speaker recognizers with time-attentive light convolutional neural networks,” in 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 2019, pp. 1–6.
- “Fakeavceleb: A novel audio-video multimodal deepfake dataset,” arXiv preprint arXiv:2108.05080, 2021.
- Vinaya Sree Katamneni (2 papers)
- Ajita Rattani (28 papers)