Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SAVE: Segment Audio-Visual Easy way using Segment Anything Model (2407.02004v2)

Published 2 Jul 2024 in cs.CV, cs.AI, cs.SD, and eess.AS

Abstract: The primary aim of Audio-Visual Segmentation (AVS) is to precisely identify and locate auditory elements within visual scenes by accurately predicting segmentation masks at the pixel level. Achieving this involves comprehensively considering data and model aspects to address this task effectively. This study presents a lightweight approach, SAVE, which efficiently adapts the pre-trained segment anything model (SAM) to the AVS task. By incorporating an image encoder adapter into the transformer blocks to better capture the distinct dataset information and proposing a residual audio encoder adapter to encode the audio features as a sparse prompt, our proposed model achieves effective audio-visual fusion and interaction during the encoding stage. Our proposed method accelerates the training and inference speed by reducing the input resolution from 1024 to 256 pixels while achieving higher performance compared with the previous SOTA. Extensive experimentation validates our approach, demonstrating that our proposed model outperforms other SOTA methods significantly. Moreover, leveraging the pre-trained model on synthetic data enhances performance on real AVSBench data, achieving 84.59 mIoU on the S4 (V1S) subset and 70.28 mIoU on the MS3 (V1M) set with only 256 pixels for input images. This increases up to 86.16 mIoU on the S4 (V1S) and 70.83 mIoU on the MS3 (V1M) with inputs of 1024 pixels.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Localizing visual sounds the hard way. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16867–16876.
  2. Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:2305.05803.
  3. SAM Fails to Segment Anything?–SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, and More. arXiv preprint arXiv:2304.09148.
  4. A Closer Look at Audio-Visual Semantic Segmentation. arXiv e-prints, arXiv–2304.
  5. Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534.
  6. Sam on medical images: A comprehensive study on three prompt modes. arXiv preprint arXiv:2305.00035.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  8. Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks. Advances in Neural Information Processing Systems, 36.
  9. Sstvos: Sparse spatiotemporal transformers for video object segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5912–5921.
  10. Visualvoice: Audio-visual speech separation with cross-modal consistency. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 15490–15500. IEEE.
  11. Avsegformer: Audio-visual segmentation with transformer. arXiv preprint arXiv:2307.01146.
  12. DeSAM: Decoupling Segment Anything Model for Generalizable Medical Image Segmentation. arXiv preprint arXiv:2306.00499.
  13. Weakly-Supervised Concealed Object Segmentation with SAM-based Pseudo Labeling and Multi-scale Feature Grouping. arXiv preprint arXiv:2305.11003.
  14. Discriminative sounding objects localization via self-supervised audiovisual matching. Advances in Neural Information Processing Systems, 33: 10077–10087.
  15. Class-aware sounding objects localization via audiovisual correspondence. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12): 9844–9859.
  16. Discovering sounding objects by audio queries for audio visual segmentation. arXiv preprint arXiv:2309.09501.
  17. SAM Struggles in Concealed Scenes–Empirical Study on” Segment Anything”. arXiv preprint arXiv:2304.06022.
  18. Segment anything is not always perfect: An investigation of sam on different real-world applications. arXiv preprint arXiv:2304.05750.
  19. Segment Anything is A Good Pseudo-label Generator for Weakly Supervised Semantic Segmentation. arXiv preprint arXiv:2305.01275.
  20. Compacter: Efficient low-rank hypercomplex adapter layers. Advances in Neural Information Processing Systems, 34: 1022–1035.
  21. Segment anything. arXiv preprint arXiv:2304.02643.
  22. Cross-attentional audio-visual fusion for weakly-supervised action localization. In International conference on learning representations.
  23. Parameter efficient multimodal transformers for video representation learning. arXiv preprint arXiv:2012.04124.
  24. Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision, 280–296. Springer.
  25. Vision transformers are parameter-efficient audio-visual learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2299–2309.
  26. Unsupervised sound localization via iterative contrastive learning. Computer Vision and Image Understanding, 227: 103602.
  27. Frozen clip models are efficient video learners. In European Conference on Computer Vision, 388–404. Springer.
  28. Hear to Segment: Unmixing the Audio to Guide the Semantic Segmentation. arXiv preprint arXiv:2305.07223.
  29. Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics. In Proceedings of the 31st ACM International Conference on Multimedia, 7590–7598.
  30. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35: 1950–1965.
  31. Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation. arXiv preprint arXiv:2307.13236.
  32. Exploiting transformation invariance and equivariance for self-supervised sound localisation. In Proceedings of the 30th ACM International Conference on Multimedia, 3742–3753.
  33. Annotation-free audio-visual segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 5604–5614.
  34. Explicit visual prompting for low-level structure segmentations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19434–19445.
  35. Segment Any Point Cloud Sequences by Distilling Vision Foundation Models. arXiv preprint arXiv:2306.09347.
  36. Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:2305.13310.
  37. Segment anything in medical images. arXiv preprint arXiv:2304.12306.
  38. Making a case for 3d convolutions for object segmentation in videos. arXiv preprint arXiv:2008.11516.
  39. Transformer transforms salient object detection and camouflaged object detection. arXiv preprint arXiv:2104.10127, 1(2): 5.
  40. Multimodal variational auto-encoder based audio-visual segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 954–965.
  41. AV-SAM: Segment anything model meets audio-visual localization and segmentation. arXiv preprint arXiv:2305.01836.
  42. Multiple sound sources localization from coarse to fine. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, 292–308. Springer.
  43. AutoSAM: Adapting SAM to Medical Images by Overloading the Prompt Encoder. arXiv preprint arXiv:2306.06370.
  44. Self-supervised predictive learning: A negative-free method for sound source localization in visual scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3222–3231.
  45. Lst: Ladder side-tuning for parameter and memory efficient transfer learning. Advances in Neural Information Processing Systems, 35: 12991–13005.
  46. Unified multisensory perception: Weakly-supervised audio-visual video parsing. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, 436–454. Springer.
  47. Audioscopev2: Audio-visual attention architectures for calibrated open-domain on-screen sound separation. In European Conference on Computer Vision, 368–385. Springer.
  48. Sam meets robotic surgery: An empirical study in robustness perspective. arXiv preprint arXiv:2304.14674.
  49. Prompting Segmentation with Sound is Generalizable Audio-Visual Source Localizer. arXiv preprint arXiv:2309.07929.
  50. Detect Any Shadow: Segment Anything for Video Shadow Detection. arXiv preprint arXiv:2305.16698.
  51. Medical sam adapter: Adapting segment anything model for medical image segmentation. arXiv preprint arXiv:2304.12620.
  52. Learning generative vision transformer with energy-based latent space for saliency prediction. Advances in Neural Information Processing Systems, 34: 15448–15463.
  53. Personalize segment anything model with one shot. arXiv preprint arXiv:2305.03048.
  54. SAM-helps-Shadow: When Segment Anything Model meet shadow removal. arXiv e-prints, arXiv–2306.
  55. Audio–visual segmentation. In European Conference on Computer Vision, 386–403. Springer.

Summary

We haven't generated a summary for this paper yet.