Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 74 tok/s
Gemini 2.5 Pro 37 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 104 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 448 tok/s Pro
Claude Sonnet 4.5 32 tok/s Pro
2000 character limit reached

SAMDWICH: Moment-aware Video-text Alignment for Referring Video Object Segmentation (2508.11955v1)

Published 16 Aug 2025 in cs.CV

Abstract: Referring Video Object Segmentation (RVOS) aims to segment and track objects in videos based on natural language expressions, requiring precise alignment between visual content and textual queries. However, existing methods often suffer from semantic misalignment, largely due to indiscriminate frame sampling and supervision of all visible objects during training -- regardless of their actual relevance to the expression. To address this, we introduce a moment-aware RVOS framework named SAMDWICH, along with a newly annotated dataset, MeViS-M, built upon the challenging MeViS benchmark. We manually annotate temporal moments indicating when each object is referred to by the expression, enabling semantically grounded supervision that strengthens video-text alignment. SAMDWICH leverages these aligned text-to-clip pairs to guide training, significantly enhancing referential understanding. Building upon this framework, we propose Moment-guided Dual-path Propagation (MDP), a moment-aware propagation strategy that improves both object grounding and tracking by training on both relevant and irrelevant frames through a moment-centric memory mechanism. In addition, we introduce Object-level Selective Supervision (OSS), an object-level filtering strategy that supervises only the objects temporally aligned with the expression in each training clip. This selective supervision reduces semantic noise and reinforces language-conditioned learning. Extensive experiments show that SAMDWICH achieves state-of-the-art performance on challenging MeViS benchmark, particularly excelling in complex scenarios involving diverse expressions.

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube