Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization (1807.00230v2)

Published 30 Jun 2018 in cs.CV

Abstract: There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further finetuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of +19.9% in action recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.

Citations (462)

View on Semantic Scholar

Summary

The paper presents a novel binary classification approach using contrastive and curriculum learning to align audio and video streams.
Experimental results reveal up to 19.9% accuracy gains on UCF101 and strong performance on benchmarks like ESC-50 without additional fine-tuning.
The method reduces dependency on labeled data and establishes a foundation for robust multimodal learning in video indexing and autonomous interpretation.

Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

The paper presents a novel approach to leveraging the natural correlation between the visual and auditory elements in videos to enhance model training via self-supervised learning. This method, termed Audio-Visual Temporal Synchronization (AVTS), utilizes a two-stream network architecture to address the task of determining synchronization between audio and video streams, which is hypothesized to enhance the learning of robust multimodal feature representations.

Methodology

AVTS is designed as a binary classification task that assesses synchronization between paired audio and video samples, thus pushing the model to uncover temporal correlations beyond mere semantic associations. The network integrates two substreams, one for audio using a VGG-like architecture, and the other for video utilizing a 3D convolutional network.

Crucial to this approach is the use of contrastive loss, which better captures the synchronization task by ensuring proximity in feature space for temporally aligned pairs while distancing misaligned ones. The employment of curriculum learning further reinforces this architecture, wherein the training begins with easier negatives and progresses to harder negatives. Hard negatives consist of audio and video sampled from different temporal segments of the same video, challenging the model to focus on synchronization rather than semantical overlap.

Experimental Results

Without additional finetuning, the audio features extracted demonstrate competitive performance on benchmarks like DCASE2014 and ESC-50, illustrating the efficacy of the self-supervised approach. For instance, audio features achieve accuracy on par or exceeding state-of-the-art benchmarks, showing the approach's capability to generalize across tasks.

On video-based action recognition tasks, AVTS pretraining confers significant advantages. Specifically, it achieves substantial accuracy gains on UCF101 and HMDB51 datasets compared to models trained from scratch. The self-supervised pretraining manifests a dramatic increase of 19.9% on UCF101 and 17.7% on HMDB51. Furthermore, when trained on larger datasets like AudioSet, these models demonstrate performance nearing fully-supervised models, suggesting scalability benefits of the approach.

Implications and Future Directions

The research offers substantial practical implications for tasks requiring unified understanding of video and audio content, relevant to areas such as video indexing, multimedia retrieval, and autonomous video interpretation. The ability to train models without manual annotations also reduces dependency on labeled datasets, which is pivotal given the growing volume of multimedia data.

The proposed AVTS framework provides a robust foundation for future advancements in self-supervised learning for multimodal data. Enhancements could focus on scaling up the training datasets or exploring more sophisticated network architectures and loss functions to further refine feature extraction. As self-supervised approaches continue to evolve, potential integration with other modalities, such as text, may also yield broader applications in AI.

The paper underscores a significant step forward in utilizing natural correlations in videos to facilitate effective learning of audio and visual models, reaffirming the vital role of synchronization in crafting versatile and accurate AI systems.