Sufficiency of visual region selection for reliable audio–visual separation

Determine whether selecting a visual region alone (for example, providing a binary mask over video frames using tools such as SAM 2) is sufficient to reliably separate the corresponding audio source in real-world videos that include both on-screen and off-screen sources and exhibit ambiguous audio–visual correspondences.

Background

The paper notes that prior visual-prompted separation systems are typically evaluated on small or synthetic datasets and that real-world videos often contain off-screen sounds and ambiguous audio–visual correspondences. In such scenarios, it is uncertain whether simple visual localization (e.g., selecting a region) adequately specifies the target audio for faithful separation.

This uncertainty motivates SAM Audio’s multimodal prompting design, combining text, visual masks, and temporal spans, but the specific question of whether visual selection alone suffices in realistic conditions is explicitly flagged as unclear, indicating a concrete unresolved issue in audio–visual separation.

References

Real-world videos contain a mix of on-screen and off-screen sources and ambiguous sound–vision correspondences, making it unclear whether simply selecting a visual region is sufficient for reliable separation.

SAM Audio: Segment Anything in Audio  (2512.18099 - Shi et al., 19 Dec 2025) in Section 1 (Introduction)