Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Synchformer: Efficient Synchronization from Sparse Cues (2401.16423v1)

Published 29 Jan 2024 in cs.CV, cs.LG, cs.MM, cs.SD, and eess.AS

Abstract: Our objective is audio-visual synchronization with a focus on 'in-the-wild' videos, such as those on YouTube, where synchronization cues can be sparse. Our contributions include a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training. This approach achieves state-of-the-art performance in both dense and sparse settings. We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. “Perfect match: Improved cross-modal embeddings for audio-visual synchronisation,” in ICASSP, 2019.
  2. J. S. Chung and A. Zisserman, “Out of time: automated lip sync in the wild,” in Workshop on Multi-view Lip-reading, ACCV, 2016.
  3. “Self-supervised learning of audio-visual objects from video,” in ECCV, 2020.
  4. R. Arandjelovic and A. Zisserman, “Objects that sound,” in ECCV, 2018.
  5. A. Owens and A. Efros, “Audio-visual scene analysis with self-supervised multisensory features,” in ECCV, 2018.
  6. “Audio-visual synchronisation in the wild,” in BMVC, 2021.
  7. “Sparse in space and time: Audio-visual synchronisation with trainable selectors,” in BMVC, 2022.
  8. “Audio set: An ontology and human-labeled dataset for audio events,” in ICASSP, 2017.
  9. J. Hershey and J. Movellan, “Audio vision: Using audio-visual synchrony to locate sounds,” NeurIPS, 1999.
  10. M. Slaney and M. Covell, “Facesync: A linear operator for measuring synchronization of video facial images and audio tracks,” NeurIPS, 2000.
  11. J. S. Chung and A. Zisserman, “Lip reading in the wild,” in ACCV, 2016.
  12. L. Rabiner and B.-H. Juang, Fundamentals of speech recognition, Prentice-Hall, Inc., 1993.
  13. “Dynamic temporal alignment of speech to lips,” in ICASSP, 2019.
  14. “On attention modules for audio-visual synchronization,” in Workshop on Sight and Sound, CVPR, 2019.
  15. “End-to-end lip synchronisation based on pattern classification,” in SLT Workshop, 2021.
  16. “Vocalist: An audio-visual synchronisation model for lips and voices,” in Interspeech, 2022.
  17. “ModEFormer: Modality-preserving embedding for audio-video synchronization using transformers,” in ICASSP. IEEE, 2023.
  18. “VGG-Sound: A large-scale audio-visual dataset,” in ICASSP, 2020.
  19. “AST: Audio Spectrogram Transformer,” in Interspeech, 2021.
  20. “Keeping your eye on the ball: Trajectory attention in video transformers,” in NeurIPS, 2021.
  21. “Is space-time attention all you need for video understanding?,” in ICML, 2021.
  22. “Attention is all you need,” in NeurIPS, 2017.
  23. “BERT: Pre-training of deep bidirectional transformers for language understanding,” in NAACL: HLT, 2019.
  24. “Learning transferable visual models from natural language supervision,” in ICML, 2021.
  25. “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
  26. “LRS3-TED: a large-scale dataset for visual speech recognition,” arXiv preprint arXiv:1809.00496, 2018.
  27. “Deep residual learning for image recognition,” in CVPR, 2016.
  28. “Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification,” in ECCV, 2018.
  29. “End-to-end object detection with transformers,” in ECCV, 2020.
  30. “The "something something" video database for learning and evaluating visual common sense,” in ICCV, 2017.
Citations (6)

Summary

  • The paper introduces a two-stage training paradigm that separates feature extraction from synchronization modeling to effectively align audio-visual streams.
  • It employs segment-level contrastive pre-training and transformer encoders to efficiently process transient synchronization cues in challenging, open-domain videos.
  • The model further innovates by incorporating evidence attribution and synchronizability prediction, enhancing both interpretability and performance.

Introduction

The presentation of a new model for audio-visual synchronization, named Synchformer, charts significant progress in aligning audio and visual streams, particularly in challenging 'in-the-wild' video settings where synchronization cues may be sporadic. As videos from platforms like YouTube are used, detection of these sparse cues is essential to accurately identify the temporal offset, a measurement of the time difference between corresponding events in audio and visual channels. The Synchformer's training process bifurcates feature extraction from synchronization modeling, using a bespoke multi-modal segment-level contrastive pre-training. The result is enhanced performance across both densely and sparsely cued environments.

Related Work

The heritage of synchronization models is rooted in dense cue scenarios such as talking heads where informative signals are abundant. This contrasts with open-domain videos where cues are transient and scattered, demanding models to interpret extended temporal windows to capture sporadic synchronization cues. Prior approaches have grappled with challenges including reliance on densely-cued data for pre-training and the necessity for end-to-end training, both of which incur high computational costs and confine feature extractor selection. The limitations of earlier designs, such as those encountered by Iashin et al., have spurred the creation of Synchformer, which imposes a two-stage training paradigm and extends the training to a million-scale dataset, AudioSet. The Synchformer also explores interpretability through evidence attribution techniques and examines its nascent ability to discern audio-visual synchronizability.

Methodology

Synchformer diverges from full video analysis, by dissecting videos into temporal segments and separately extracting audio and visual features. A transformer encoder then aggregates these features, and a synchronization module predicts the temporal offset. This architectural innovation permits handling shorter input sequences, resulting in a less memory-intensive training process. The method also stipulates a segmentation approach, with audio and visual features extracted from designated segments which overlap to improve synchronization precision. The feature extractors undergo segment-level contrastive pre-training, followed by a synchronization module training using pre-trained and frozen extractors.

Additional Capabilities and Conclusion

Synchformer explores two additional capabilities: evidence attribution and synchronizability prediction. Evidence attribution seeks to highlight the temporal portions of the audio and visual streams that the model relies on for its predictions. Synchronizability prediction, a novel concept introduced by Synchformer, assesses whether the audio and visual streams can be synchronized at all. Through its innovative design and the implementation of these additional capabilities, Synchformer marks a quantum leap in synchronization model performance. Not only does it efficiently handle sparse synchronization cues, it also sets a foundation for understanding model predictions and extends the boundaries of synchronization tasks to new challenges such as synchronizability assessment.