2000 character limit reached
Synchformer: Efficient Synchronization from Sparse Cues (2401.16423v1)
Published 29 Jan 2024 in cs.CV, cs.LG, cs.MM, cs.SD, and eess.AS
Abstract: Our objective is audio-visual synchronization with a focus on 'in-the-wild' videos, such as those on YouTube, where synchronization cues can be sparse. Our contributions include a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training. This approach achieves state-of-the-art performance in both dense and sparse settings. We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.
- “Perfect match: Improved cross-modal embeddings for audio-visual synchronisation,” in ICASSP, 2019.
- J. S. Chung and A. Zisserman, “Out of time: automated lip sync in the wild,” in Workshop on Multi-view Lip-reading, ACCV, 2016.
- “Self-supervised learning of audio-visual objects from video,” in ECCV, 2020.
- R. Arandjelovic and A. Zisserman, “Objects that sound,” in ECCV, 2018.
- A. Owens and A. Efros, “Audio-visual scene analysis with self-supervised multisensory features,” in ECCV, 2018.
- “Audio-visual synchronisation in the wild,” in BMVC, 2021.
- “Sparse in space and time: Audio-visual synchronisation with trainable selectors,” in BMVC, 2022.
- “Audio set: An ontology and human-labeled dataset for audio events,” in ICASSP, 2017.
- J. Hershey and J. Movellan, “Audio vision: Using audio-visual synchrony to locate sounds,” NeurIPS, 1999.
- M. Slaney and M. Covell, “Facesync: A linear operator for measuring synchronization of video facial images and audio tracks,” NeurIPS, 2000.
- J. S. Chung and A. Zisserman, “Lip reading in the wild,” in ACCV, 2016.
- L. Rabiner and B.-H. Juang, Fundamentals of speech recognition, Prentice-Hall, Inc., 1993.
- “Dynamic temporal alignment of speech to lips,” in ICASSP, 2019.
- “On attention modules for audio-visual synchronization,” in Workshop on Sight and Sound, CVPR, 2019.
- “End-to-end lip synchronisation based on pattern classification,” in SLT Workshop, 2021.
- “Vocalist: An audio-visual synchronisation model for lips and voices,” in Interspeech, 2022.
- “ModEFormer: Modality-preserving embedding for audio-video synchronization using transformers,” in ICASSP. IEEE, 2023.
- “VGG-Sound: A large-scale audio-visual dataset,” in ICASSP, 2020.
- “AST: Audio Spectrogram Transformer,” in Interspeech, 2021.
- “Keeping your eye on the ball: Trajectory attention in video transformers,” in NeurIPS, 2021.
- “Is space-time attention all you need for video understanding?,” in ICML, 2021.
- “Attention is all you need,” in NeurIPS, 2017.
- “BERT: Pre-training of deep bidirectional transformers for language understanding,” in NAACL: HLT, 2019.
- “Learning transferable visual models from natural language supervision,” in ICML, 2021.
- “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
- “LRS3-TED: a large-scale dataset for visual speech recognition,” arXiv preprint arXiv:1809.00496, 2018.
- “Deep residual learning for image recognition,” in CVPR, 2016.
- “Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification,” in ECCV, 2018.
- “End-to-end object detection with transformers,” in ECCV, 2020.
- “The "something something" video database for learning and evaluating visual common sense,” in ICCV, 2017.