Sufficiency of visual region selection for reliable audio–visual separation
Determine whether selecting a visual region alone (for example, providing a binary mask over video frames using tools such as SAM 2) is sufficient to reliably separate the corresponding audio source in real-world videos that include both on-screen and off-screen sources and exhibit ambiguous audio–visual correspondences.
References
Real-world videos contain a mix of on-screen and off-screen sources and ambiguous soundâvision correspondences, making it unclear whether simply selecting a visual region is sufficient for reliable separation.
— SAM Audio: Segment Anything in Audio
(2512.18099 - Shi et al., 19 Dec 2025) in Section 1 (Introduction)