Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization (2403.03145v1)
Abstract: Audio-Visual Source Localization (AVSL) aims to locate sounding objects within video frames given the paired audio clips. Existing methods predominantly rely on self-supervised contrastive learning of audio-visual correspondence. Without any bounding-box annotations, they struggle to achieve precise localization, especially for small objects, and suffer from blurry boundaries and false positives. Moreover, the naive semi-supervised method is poor in fully leveraging the information of abundant unlabeled data. In this paper, we propose a novel semi-supervised learning framework for AVSL, namely Dual Mean-Teacher (DMT), comprising two teacher-student structures to circumvent the confirmation bias issue. Specifically, two teachers, pre-trained on limited labeled data, are employed to filter out noisy samples via the consensus between their predictions, and then generate high-quality pseudo-labels by intersecting their confidence maps. The sufficient utilization of both labeled and unlabeled data and the proposed unbiased framework enable DMT to outperform current state-of-the-art methods by a large margin, with CIoU of 90.4% and 48.8% on Flickr-SoundNet and VGG-Sound Source, obtaining 8.9%, 9.6% and 4.6%, 6.4% improvements over self- and semi-supervised methods respectively, given only 3% positional-annotations. We also extend our framework to some existing AVSL methods and consistently boost their performance.
- Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
- Deep audio-visual learning: A survey. International Journal of Automation and Computing, 18:351–376, 2021.
- Janani Ramaswamy. What makes the sound?: A dual-modality interacting network for audio-visual event localization. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4372–4376, 2020. doi: 10.1109/ICASSP40776.2020.9053895.
- Learning to set waypoints for audio-visual navigation. arXiv preprint arXiv:2008.09622, 2020a.
- Semantic audio-visual navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15516–15525, June 2021a.
- Into the wild with audioscope: Unsupervised audio-visual separation of on-screen sounds. arXiv preprint arXiv:2011.01143, 2020.
- Audioscopev2: Audio-visual attention architectures for calibrated open-domain on-screen sound separation. In European Conference on Computer Vision, pages 368–385. Springer, 2022.
- Discriminative cross-modality attention network for temporal inconsistent audio-visual event localization. IEEE Transactions on Image Processing, 30:7878–7888, 2021. doi: 10.1109/TIP.2021.3106814.
- Localizing visual sounds the easy way. arXiv preprint arXiv:2203.09324, 2022a.
- A closer look at weakly-supervised audio-visual source localization. In Advances in Neural Information Processing Systems, 2022b.
- Audio-visual grouping network for sound localization from mixtures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Multi-task learning for dense prediction tasks: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(7):3614–3633, 2021.
- A survey on deep semi-supervised learning. IEEE Transactions on Knowledge and Data Engineering, pages 1–20, 2022. doi: 10.1109/TKDE.2022.3220219.
- Learning to localize sound source in visual scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4358–4366, 2018.
- Learning to localize sound sources in visual scenes: Analysis and applications. TPAMI, 43(5):1605–1619, 2019.
- Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, page 896. Atlanta, 2013.
- Pseudo-labeling and confirmation bias in deep semi-supervised learning. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2020.
- Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, 30, 2017.
- A survey on semi-supervised learning. Machine learning, 109(2):373–440, 2020.
- Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/30ef30b64204a3088a26bc2e6ecf7602-Paper.pdf.
- Temporal ensembling for semi-supervised learning. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=BJ6oOfqge.
- Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- Mixmatch: A holistic approach to semi-supervised learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/1cd138d0499a68f4bb72bee04bbec2d7-Paper.pdf.
- Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 596–608. Curran Associates, Inc., 2020a. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/06964dce9addb1c5cb5d6e3d9838f733-Paper.pdf.
- Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. Advances in Neural Information Processing Systems, 34:18408–18419, 2021.
- A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020b.
- Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029, 2020c.
- Localizing visual sounds the hard way. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16867–16876, 2021b.
- Self-supervised predictive learning: A negative-free method for sound source localization in visual scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3222–3231, 2022.
- Exploiting transformation invariance and equivariance for self-supervised sound localisation. In Proceedings of the 30th ACM International Conference on Multimedia, pages 3742–3753, 2022a.
- Discriminative sounding objects localization via self-supervised audiovisual matching. Advances in Neural Information Processing Systems, 33:10077–10087, 2020.
- Multiple sound sources localization from coarse to fine. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pages 292–308. Springer, 2020.
- Mix and localize: Localizing sound sources in mixtures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10483–10492, 2022.
- Visual sound localization in the wild by cross-modal interference erasing. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1801–1809, 2022b.
- Robust audio-visual instance discrimination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12934–12945, 2021a.
- Audio-visual instance discrimination with cross-modal agreement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12475–12486, 2021b.
- The sound of pixels. In Proceedings of the European conference on computer vision (ECCV), pages 570–586, 2018.
- The sound of motions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1735–1744, 2019.
- Audio-visual localization by synthetic acoustic image generation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2523–2531, 2021.
- A simple semi-supervised learning framework for object detection. arXiv preprint arXiv:2005.04757, 2020b.
- End-to-end semi-supervised object detection with soft teacher. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3060–3069, 2021.
- Dtg-ssod: Dense teacher guidance for semi-supervised object detection. Advances in Neural Information Processing Systems, 35:8840–8852, 2022.
- Instant-teaching: An end-to-end semi-supervised object detection framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4081–4090, 2021.
- Unbiased teacher for semi-supervised object detection. arXiv preprint arXiv:2102.09480, 2021.
- Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242, 2016.
- Vggsound: A large-scale audio-visual dataset. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020d.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Cnn architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), pages 131–135. IEEE, 2017.
- Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, 2016.
- Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2020.
- Towards trustworthy dataset distillation. arXiv preprint arXiv:2307.09165, 2023.
- Open-world machine learning: A review and new outlooks. arXiv preprint arXiv:2403.01759, 2024.
- Class-aware sounding objects localization via audiovisual correspondence. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):9844–9859, 2021.
- Yuxin Guo (21 papers)
- Shijie Ma (14 papers)
- Hu Su (5 papers)
- Zhiqing Wang (3 papers)
- Yuhao Zhao (13 papers)
- Wei Zou (62 papers)
- Siyang Sun (12 papers)
- Yun Zheng (49 papers)