Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

iQuery: Instruments as Queries for Audio-Visual Sound Separation (2212.03814v2)

Published 7 Dec 2022 in cs.CV, cs.MM, cs.SD, and eess.AS

Abstract: Current audio-visual separation methods share a standard architecture design where an audio encoder-decoder network is fused with visual encoding features at the encoder bottleneck. This design confounds the learning of multi-modal feature encoding with robust sound decoding for audio separation. To generalize to a new instrument: one must finetune the entire visual and audio network for all musical instruments. We re-formulate visual-sound separation task and propose Instrument as Query (iQuery) with a flexible query expansion mechanism. Our approach ensures cross-modal consistency and cross-instrument disentanglement. We utilize "visually named" queries to initiate the learning of audio queries and use cross-modal attention to remove potential sound source interference at the estimated waveforms. To generalize to a new instrument or event class, drawing inspiration from the text-prompt design, we insert an additional query as an audio prompt while freezing the attention mechanism. Experimental results on three benchmarks demonstrate that our iQuery improves audio-visual sound source separation performance.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jiaben Chen (12 papers)
  2. Renrui Zhang (100 papers)
  3. Dongze Lian (19 papers)
  4. Jiaqi Yang (107 papers)
  5. Ziyao Zeng (12 papers)
  6. Jianbo Shi (57 papers)
Citations (22)

Summary

We haven't generated a summary for this paper yet.