Multimodal Transformer Distillation for Audio-Visual Synchronization (2210.15563v3)
Abstract: Audio-visual synchronization aims to determine whether the mouth movements and speech in the video are synchronized. VocaLiST reaches state-of-the-art performance by incorporating multimodal Transformers to model audio-visual interact information. However, it requires high computing resources, making it impractical for real-world applications. This paper proposed an MTDVocaLiST model, which is trained by our proposed multimodal Transformer distillation (MTD) loss. MTD loss enables MTDVocaLiST model to deeply mimic the cross-attention distribution and value-relation in the Transformer of VocaLiST. Additionally, we harness uncertainty weighting to fully exploit the interaction information across all layers. Our proposed method is effective in two aspects: From the distillation method perspective, MTD loss outperforms other strong distillation baselines. From the distilled model's performance perspective: 1) MTDVocaLiST outperforms similar-size SOTA models, SyncNet, and Perfect Match models by 15.65% and 3.35%; 2) MTDVocaLiST reduces the model size of VocaLiST by 83.52%, yet still maintaining similar performance.
- “Text-dependent audiovisual synchrony detection for spoofing detection in mobile person recognition.,” in Interspeech, 2016, vol. 2, p. 4.
- “A lip sync expert is all you need for speech to lip generation in the wild,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 484–492.
- “Perfect match: Improved cross-modal embeddings for audio-visual synchronisation,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 3965–3969.
- “Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 3927–3935.
- “Push-pull: Characterizing the adversarial robustness for audio-visual active speaker detection,” in 2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 692–699.
- “Self-supervised learning of audio-visual objects from video,” in European Conference on Computer Vision. Springer, 2020, pp. 208–224.
- “Audio-visual scene analysis with self-supervised multisensory features,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 631–648.
- “Visually guided sound source separation and localization using self-supervised motion representations,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 1289–1299.
- “Selective listening by synchronizing speech with lips,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1650–1664, 2022.
- “Out of time: automated lip sync in the wild,” in Asian conference on computer vision. Springer, 2016, pp. 251–263.
- “Perfect match: Self-supervised embeddings for cross-modal retrieval,” IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 3, pp. 568–576, 2020.
- “Audio-visual synchronisation in the wild,” arXiv preprint arXiv:2112.04432, 2021.
- “Vocalist: An audio-visual synchronisation model for lips and voices,” arXiv preprint arXiv:2204.02090, 2022.
- “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, vol. 2, no. 7, 2015.
- “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550, 2014.
- “Relational knowledge distillation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3967–3976.
- “Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer,” arXiv preprint arXiv:1612.03928, 2016.
- “Similarity-preserving knowledge distillation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1365–1374.
- “Correlation congruence for knowledge distillation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5007–5016.
- “Variational information distillation for knowledge transfer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9163–9171.
- “Learning deep representations with probabilistic knowledge transfer,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 268–284.
- “Knowledge transfer via distillation of activation boundaries formed by hidden neurons,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, vol. 33, pp. 3779–3787.
- “Paraphrasing complex network: Network compression via factor transfer,” Advances in neural information processing systems, vol. 31, 2018.
- “A gift from knowledge distillation: Fast optimization, network minimization and transfer learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4133–4141.
- “Like what you like: Knowledge distill via neuron selectivity transfer,” arXiv preprint arXiv:1707.01219, 2017.
- “Contrastive representation distillation,” in International Conference on Learning Representations, 2020.
- “Wasserstein contrastive representation distillation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 16296–16305.
- “Adversarial speaker distillation for countermeasure model on automatic speaker verification,” in Proc. 2nd Symposium on Security and Privacy in Speech Communication, 2022, pp. 30–34.
- “Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers,” Advances in Neural Information Processing Systems, vol. 33, pp. 5776–5788, 2020.
- “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. arxiv,” arXiv preprint arXiv:1705.07115, 2017.
- “Auxiliary tasks in multi-task learning,” arXiv preprint arXiv:1805.06334, 2018.
- “Deep audio-visual speech recognition,” IEEE transactions on pattern analysis and machine intelligence, 2018.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.