Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation (2304.02970v7)

Published 6 Apr 2023 in cs.CV and cs.MM

Abstract: Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues. The effectiveness of audio-visual learning critically depends on achieving accurate cross-modal alignment between sound and visual objects. Successful audio-visual learning requires two essential components: 1) a challenging dataset with high-quality pixel-level multi-class annotated images associated with audio files, and 2) a model that can establish strong links between audio information and its corresponding visual object. However, these requirements are only partially addressed by current methods, with training sets containing biased audio-visual data, and models that generalise poorly beyond this biased training set. In this work, we propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks. We also propose a new informative sample mining method for audio-visual supervised contrastive learning to leverage discriminative contrastive samples to enforce cross-modal understanding. We show empirical results that demonstrate the effectiveness of our benchmark. Furthermore, experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Self-supervised learning of audio-visual objects from video. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pages 208–224. Springer, 2020.
  2. Look, listen and learn. In Proceedings of the IEEE international conference on computer vision, pages 609–617, 2017.
  3. Objects that sound. In Proceedings of the European conference on computer vision (ECCV), pages 435–451, 2018.
  4. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
  5. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020a.
  6. Localizing visual sounds the hard way. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16867–16876, 2021.
  7. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018.
  8. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020b.
  9. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020c.
  10. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021.
  11. Visual referring expression recognition: What do systems actually learn? arXiv preprint arXiv:1805.11818, 2018.
  12. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  13. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  14. The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010.
  15. The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111:98–136, 2015.
  16. Avsegformer: Audio-visual segmentation with transformer. arXiv preprint arXiv:2307.01146, 2023.
  17. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017.
  18. Improving audio-visual segmentation with bidirectional generation. arXiv preprint arXiv:2308.08288, 2023.
  19. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  20. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
  21. Cnn architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), pages 131–135. IEEE, 2017.
  22. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  23. Mix and localize: Localizing sound sources in mixtures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10483–10492, 2022.
  24. Discovering sounding objects by audio queries for audio visual segmentation. arXiv preprint arXiv:2309.09501, 2023.
  25. Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673, 2020.
  26. Catr: Combinatorial-dependence audio-queried transformer for audio-visual video segmentation. arXiv preprint arXiv:2309.09709, 2023.
  27. Targeted supervised contrastive learning for long-tailed recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6918–6928, 2022.
  28. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  29. Audio-visual segmentation by exploring cross-modal mutual semantics, 2023a.
  30. Bavs: Bootstrapping audio-visual segmentation by integrating foundation knowledge. arXiv preprint arXiv:2308.10175, 2023b.
  31. Audio-aware query-enhanced transformer for audio-visual segmentation. arXiv preprint arXiv:2307.13236, 2023c.
  32. Annotation-free audio-visual segmentation. arXiv preprint arXiv:2305.11019, 2023d.
  33. Clevr-ref+: Diagnosing visual reasoning with referring expressions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4185–4194, 2019.
  34. Contrastive multimodal fusion with tupleinfonce. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 754–763, 2021.
  35. Contrastive conditional latent diffusion for audio-visual segmentation. arXiv preprint arXiv:2307.16579, 2023a.
  36. Multimodal variational auto-encoder based audio-visual segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 954–965, 2023b.
  37. Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE transactions on pattern analysis and machine intelligence, 26(5):530–549, 2004.
  38. A closer look at weakly-supervised audio-visual source localization. arXiv preprint arXiv:2209.09634, 2022a.
  39. Localizing visual sounds the easy way. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, pages 218–234. Springer, 2022b.
  40. Robust audio-visual instance discrimination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12934–12945, 2021a.
  41. Audio-visual instance discrimination with cross-modal agreement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12475–12486, 2021b.
  42. Attention bottlenecks for multimodal fusion. Advances in Neural Information Processing Systems, 34:14200–14213, 2021.
  43. The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of the IEEE international conference on computer vision, pages 4990–4999, 2017.
  44. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4004–4012, 2016.
  45. Balanced multimodal learning via on-the-fly gradient modulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8238–8247, 2022.
  46. Karol J Piczak. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pages 1015–1018, 2015.
  47. Localize to binauralize: Audio spatialization from visual sound source localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1930–1939, 2021.
  48. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
  49. Learning to localize sound source in visual scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4358–4366, 2018.
  50. Sound source localization is all about cross-modal alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7777–7787, 2023.
  51. Odor/taste integration and the perception of flavor. Experimental brain research, 166:345–357, 2005.
  52. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, 28, 2015.
  53. Learning audio-visual source localization via false negative aware contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6420–6429, 2023.
  54. Long-tailed classification by keeping the good and removing the bad momentum causal effect. Advances in Neural Information Processing Systems, 33:1513–1524, 2020.
  55. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  56. Exploring cross-image pixel contrast for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7303–7313, 2021.
  57. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3):415–424, 2022.
  58. Binaural audio-visual localization. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2961–2968, 2021.
  59. Torchaudio: Building blocks for audio and speech processing. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6982–6986. IEEE, 2022.
  60. Hardness-aware deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 72–81, 2019.
  61. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.
  62. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127:302–321, 2019.
  63. Audio–visual segmentation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, pages 386–403. Springer, 2022.
  64. Audio-visual segmentation with semantics. arXiv preprint arXiv:2301.13190, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yuanhong Chen (30 papers)
  2. Yuyuan Liu (26 papers)
  3. Hu Wang (79 papers)
  4. Fengbei Liu (24 papers)
  5. Chong Wang (308 papers)
  6. Helen Frazer (7 papers)
  7. Gustavo Carneiro (129 papers)
Citations (7)

Summary

We haven't generated a summary for this paper yet.