Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser (2305.17343v2)

Published 27 May 2023 in cs.CV, cs.SD, and eess.AS

Abstract: Audio-visual learning has been a major pillar of multi-modal machine learning, where the community mostly focused on its modality-aligned setting, i.e., the audio and visual modality are both assumed to signal the prediction target. With the Look, Listen, and Parse dataset (LLP), we investigate the under-explored unaligned setting, where the goal is to recognize audio and visual events in a video with only weak labels observed. Such weak video-level labels only tell what events happen without knowing the modality they are perceived (audio, visual, or both). To enhance learning in this challenging setting, we incorporate large-scale contrastively pre-trained models as the modality teachers. A simple, effective, and generic method, termed Visual-Audio Label Elaboration (VALOR), is innovated to harvest modality labels for the training events. Empirical studies show that the harvested labels significantly improve an attentional baseline by 8.0 in average F-score (Type@AV). Surprisingly, we found that modality-independent teachers outperform their modality-fused counterparts since they are noise-proof from the other potentially unaligned modality. Moreover, our best model achieves the new state-of-the-art on all metrics of LLP by a substantial margin (+5.4 F-score for Type@AV). VALOR is further generalized to Audio-Visual Event Localization and achieves the new state-of-the-art as well. Code is available at: https://github.com/Franklin905/VALOR.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yung-Hsuan Lai (4 papers)
  2. Yen-Chun Chen (33 papers)
  3. Yu-Chiang Frank Wang (88 papers)
Citations (6)