Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation (2307.13236v1)

Published 25 Jul 2023 in cs.SD, cs.CV, cs.LG, cs.MM, and eess.AS

Abstract: The goal of the audio-visual segmentation (AVS) task is to segment the sounding objects in the video frames using audio cues. However, current fusion-based methods have the performance limitations due to the small receptive field of convolution and inadequate fusion of audio-visual features. To overcome these issues, we propose a novel \textbf{Au}dio-aware query-enhanced \textbf{TR}ansformer (AuTR) to tackle the task. Unlike existing methods, our approach introduces a multimodal transformer architecture that enables deep fusion and aggregation of audio-visual features. Furthermore, we devise an audio-aware query-enhanced transformer decoder that explicitly helps the model focus on the segmentation of the pinpointed sounding objects based on audio signals, while disregarding silent yet salient objects. Experimental results show that our method outperforms previous methods and demonstrates better generalization ability in multi-sound and open-set scenarios.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jinxiang Liu (9 papers)
  2. Chen Ju (26 papers)
  3. Chaofan Ma (17 papers)
  4. Yanfeng Wang (211 papers)
  5. Yu Wang (939 papers)
  6. Ya Zhang (222 papers)
Citations (14)

Summary

We haven't generated a summary for this paper yet.