Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
12 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
37 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Improving Audio-Visual Segmentation with Bidirectional Generation (2308.08288v2)

Published 16 Aug 2023 in cs.CV

Abstract: The aim of audio-visual segmentation (AVS) is to precisely differentiate audible objects within videos down to the pixel level. Traditional approaches often tackle this challenge by combining information from various modalities, where the contribution of each modality is implicitly or explicitly modeled. Nevertheless, the interconnections between different modalities tend to be overlooked in audio-visual modeling. In this paper, inspired by the human ability to mentally simulate the sound of an object and its visual appearance, we introduce a bidirectional generation framework. This framework establishes robust correlations between an object's visual characteristics and its associated sound, thereby enhancing the performance of AVS. To achieve this, we employ a visual-to-audio projection component that reconstructs audio features from object segmentation masks and minimizes reconstruction errors. Moreover, recognizing that many sounds are linked to object movements, we introduce an implicit volumetric motion estimation module to handle temporal dynamics that may be challenging to capture using conventional optical flow methods. To showcase the effectiveness of our approach, we conduct comprehensive experiments and analyses on the widely recognized AVSBench benchmark. As a result, we establish a new state-of-the-art performance level in the AVS benchmark, particularly excelling in the challenging MS3 subset which involves segmenting multiple sound sources. To facilitate reproducibility, we plan to release both the source code and the pre-trained model.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Look, listen and learn. In Proceedings of the IEEE international conference on computer vision, 609–617.
  2. Objects that sound. In Proceedings of the European conference on computer vision (ECCV), 435–451.
  3. Localizing visual sounds the hard way. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16867–16876.
  4. Look, listen, and attend: Co-attention network for self-supervised audio-visual representation learning. In Proceedings of the 28th ACM International Conference on Multimedia, 3884–3892.
  5. Sstvos: Sparse spatiotemporal transformers for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5912–5921.
  6. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), 776–780. IEEE.
  7. CNN architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), 131–135. IEEE.
  8. Discriminative sounding objects localization via self-supervised audiovisual matching. Advances in Neural Information Processing Systems, 33: 10077–10087.
  9. Mix and localize: Localizing sound sources in mixtures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10483–10492.
  10. Making a case for 3d convolutions for object segmentation in videos. arXiv preprint arXiv:2008.11516.
  11. Transformer transforms salient object detection and camouflaged object detection. arXiv preprint arXiv:2104.10127.
  12. Contrastive conditional latent diffusion for audio-visual segmentation. arXiv preprint arXiv:2307.16579.
  13. Multimodal variational auto-encoder based audio-visual segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 954–965.
  14. Multiple sound sources localization from coarse to fine. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, 292–308. Springer.
  15. Imagenet large scale visual recognition challenge. International journal of computer vision, 115: 211–252.
  16. Unified multisensory perception: Weakly-supervised audio-visual video parsing. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, 436–454. Springer.
  17. Learning motion patterns in videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3386–3394.
  18. Displacement-invariant matching cost learning for accurate optical flow estimation. Advances in Neural Information Processing Systems, 33: 15220–15231.
  19. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3): 415–424.
  20. Exploring heterogeneous clues for weakly-supervised audio-visual video parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1326–1335.
  21. Cross-modal attention network for temporal inconsistent audio-visual event localization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 279–286.
  22. Self-supervised video object segmentation by motion grouping. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 7177–7188.
  23. Learning generative vision transformer with energy-based latent space for saliency prediction. Advances in Neural Information Processing Systems, 34: 15448–15463.
  24. Unsupervised deep epipolar flow for stationary or dynamic scenes. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12095–12104.
  25. Displacement-Invariant Cost Computation for Stereo Matching. International Journal of Computer Vision, 130(5): 1196–1209.
  26. Contrastive positive sample propagation along the audio-visual event line. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  27. Improving audio-visual video parsing with pseudo visual labels. arXiv preprint arXiv:2303.02344.
  28. Audio–Visual Segmentation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, 386–403. f.
  29. Positive sample propagation along the audio-visual event line. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8436–8444.
Citations (24)

Summary

We haven't generated a summary for this paper yet.