AVSegFormer: Audio-Visual Segmentation with Transformer (2307.01146v4)
Abstract: The combination of audio and vision has long been a topic of interest in the multi-modal community. Recently, a new audio-visual segmentation (AVS) task has been introduced, aiming to locate and segment the sounding objects in a given video. This task demands audio-driven pixel-level scene understanding for the first time, posing significant challenges. In this paper, we propose AVSegFormer, a novel framework for AVS tasks that leverages the transformer architecture. Specifically, we introduce audio queries and learnable queries into the transformer decoder, enabling the network to selectively attend to interested visual features. Besides, we present an audio-visual mixer, which can dynamically adjust visual features by amplifying relevant and suppressing irrelevant spatial channels. Additionally, we devise an intermediate mask loss to enhance the supervision of the decoder, encouraging the network to produce more accurate intermediate predictions. Extensive experiments demonstrate that AVSegFormer achieves state-of-the-art results on the AVS benchmark. The code is available at https://github.com/vvvb-github/AVSegFormer.
- Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, 2425–2433.
- Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision, 609–617.
- Objects that sound. In Proceedings of the European Conference on Computer Vision, 435–451.
- End-to-end object detection with transformers. In Proceedings of the 16th European Conference of Computer Vision, 213–229.
- Localizing visual sounds the hard way. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16867–16876.
- Hybrid task cascade for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4974–4983.
- Vision transformer adapter for dense predictions. In International Conference on Learning Representations.
- Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1290–1299.
- Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 248–255.
- Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1769–1779.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
- Sstvos: Sparse spatiotemporal transformers for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5912–5921.
- Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, 776–780.
- Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, 2961–2969.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778.
- CNN architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, 131–135.
- OneFormer: One Transformer to Rule Universal Image Segmentation. arXiv preprint arXiv:2211.06220.
- Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1780–1790.
- Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6399–6408.
- Mask dino: Towards a unified transformer-based framework for object detection and segmentation. arXiv preprint arXiv:2206.02777.
- Panoptic segformer: Delving deeper into panoptic segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1280–1289.
- Dual-modality seq2seq network for audio-visual event localization. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing, 2002–2006.
- Audiovisual transformer with instance attention for audio-visual event localization. In Proceedings of the Asian Conference on Computer Vision.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012–10022.
- Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3431–3440.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Making a case for 3d convolutions for object segmentation in videos. arXiv preprint arXiv:2008.11516.
- Transformer transforms salient object detection and camouflaged object detection. arXiv preprint arXiv:2104.10127.
- Contrastive Conditional Latent Diffusion for Audio-visual Segmentation. arXiv:2307.16579.
- V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 40th International Conference on 3D Vision (3DV), 565–571.
- Multiple sound sources localization from coarse to fine. In Proceedings of the 16th European Conference on Computer Vision, 292–308.
- U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference of Medical Image Computing and Computer-Assisted Intervention, 234–241.
- Unified multisensory perception: Weakly-supervised audio-visual video parsing. In Proceedings of the European Conference on Computer Vision, 436–454.
- Attention is all you need. Advances in Neural Information Processing Systems, 30.
- Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, 23318–23340.
- Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media Journal, 8(3): 415–424.
- Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4622–4630.
- Exploring heterogeneous clues for weakly-supervised audio-visual video parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1326–1335.
- Upsnet: A unified panoptic segmentation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8818–8826.
- Associating objects with transformers for video object segmentation. Advances in Neural Information Processing Systems, 34: 2491–2502.
- Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605.
- Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction. In 2021 Conference on Neural Information Processing Systems.
- Audio-Visual Segmentation with Semantics. arXiv preprint arXiv:2301.13190.
- Audio–Visual Segmentation. In Proceedings of the European Conference on Computer Vision, 386–403.
- Deformable DETR: Deformable Transformers for End-to-End Object Detection. In International Conference on Learning Representations.
- Uni-perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16804–16815.