Motion-Guided Few-Shot Video Segmentation
- The paper introduces DMA, a decoupled motion-appearance network that leverages temporal dynamics for few-shot video object segmentation, achieving up to a 4.7% mean-IoU improvement.
- It establishes the MOVE benchmark with 224 motion categories and 4,300 video clips, emphasizing motion-centric segmentation over static appearance.
- Experimental results validate DMA's robust performance across overlapping and non-overlapping motion splits, underscoring the significance of motion-guided segmentation.
Motion-guided few-shot video object segmentation (FSVOS) addresses the problem of segmenting dynamic objects in video sequences based on a small number of annotated examples that share the same motion patterns, focusing on temporally coherent behaviors rather than static object categories. Unlike traditional FSVOS and few-shot segmentation tasks that emphasize object appearance and category, motion-guided FSVOS explicitly leverages temporal dynamics, enabling the segmentation of objects performing specific actions, regardless of their class or appearance (Ying et al., 29 Jul 2025).
1. Dataset Foundations: MOVE Benchmark
The MOVE dataset constitutes a foundational resource for motion-guided FSVOS research. MOVE comprises 224 motion categories distributed across four domains—daily actions, sports, entertainment activities, and special actions—with 4,300 video clips and 261,920 total frames. A total of 5,135 moving object instances are annotated, yielding 314,619 pixel-level masks, each refined to high quality by annotators with SAM2.
Each sample in MOVE includes a brief support video, depicting a prototypical instance of a target motion, and a query video for which per-frame binary segmentation is required. This design compels models to identify and segment by motion, not by static category. Binary masks are provided for every frame. Compared to prior datasets such as YouTube-VIS, FSVOD-500, and MiniVSPW—which are object-centric and lack motion supervision—MOVE provides motion-centric, video-level supports and explicitly enforces segmentation based on temporal dynamics rather than appearance (Tab. 1 in (Ying et al., 29 Jul 2025)).
Experimental protocol relies on 4-fold cross-validation under two split strategies: the Overlapping Split (OS), which allows partial overlap of parent motion categories between training and testing, and the Non-Overlapping Split (NS), which ensures no shared parent nodes. Episodes employ an N-way, K-shot structure: the support set (with video and its mask sequence ) and the query set , with unknown true masks . Primary experimental regimes are 2-way-1-shot and 5-way-1-shot, using 240,000 training and 20,000 testing episodes per fold. Model backbones include ResNet-50 and VideoSwin-Tiny pretrained on ImageNet and Kinetics-400.
2. Problem Formulation and Evaluation Protocols
Motion-guided FSVOS formally requires learning a segmentation function that, conditioned on support video-mask pairs and a query video , outputs per-frame segmentation masks for object instances in exhibiting the motion delineated in 0. Each episode consists of a set of support video-mask tuples 1 and a temporally contiguous sequence of query frames 2. The learning objective aims to minimize expected segmentation loss:
3
The evaluation employs several complementary metrics:
- Jaccard index (mean-IoU, 4):
5
- Boundary F-score (6):
7
- T-Acc: Accuracy for non-empty (foreground) query samples
- N-Acc: Accuracy for background (empty-foreground) query samples
These metrics capture both spatial segmentation quality and temporal consistency across the query videos.
3. Decoupled Motion-Appearance Network (DMA) Architecture
The Decoupled Motion-Appearance Network (DMA) introduces a five-stage pipeline, structuring feature extraction and prototype learning to emphasize temporal motion cues alongside object appearance:
- Encoder and Proposal Generator: The encoder 8 (backbone with FPN) extracts multi-scale frame features at resolutions {1/4, 1/8, 1/16, 1/32}. The proposal generator predicts coarse masks on the query sequence using features upsampled from 1/32 to 1/8 scale with a lightweight convolutional head.
- DMA Module: The primary innovation is the explicit decoupling of appearance and motion. Appearance prototypes 9 are computed via mask-pooling over 1/4-resolution features, while motion prototypes 0 are derived from temporally differenced features—1—further processed using 3D convolutions and pooled spatially.
Auxiliary classifiers operate on temporally averaged appearance and motion prototypes, imposing cross-entropy losses 2 and 3 to encourage feature disentanglement. Transformer refinement leverages learnable queries and a [CLS] token, attending over prototype sequences via cross- and self-attention to yield support/query-specific prototypes 4, 5, and [CLS]6, [CLS]7 representations.
- Prototype Attention and Mask Decoder: Cross-attention and self-attention fuse 8 (query) with 9 (support), producing 0; a multi-scale mask decoder combines this representation with FPN features to deliver the final per-frame binary masks.
- Matching Score: Cosine similarity between the support and query [CLS] tokens quantifies motion correspondence: 1[CLS]2, [CLS]3.
- Loss Functions: The total episode training loss is:
4
where 5 and 6 are combinations of cross-entropy and IoU losses for the final and proposal masks, 7 and 8 are auxiliary classification losses, and 9 is binary cross-entropy for the match score.
4. Experimental Evaluation and Ablation Analysis
DMA is evaluated against six state-of-the-art baselines from FSVOS, image-based few-shot segmentation (FSS), and video object segmentation (RVOS) on MOVE, testing both ResNet-50 and VideoSwin-Tiny backbones under OS and NS splits with 2-way-1-shot and 5-way-1-shot episodes.
Quantitative Results (ResNet-50, OS split, 2-way-1-shot):
| Method | J (mean-IoU) |
|---|---|
| DANet | 45.4% |
| DMA | 50.1% |
DMA achieves an improvement of +4.7% J over the best prior baseline. In 5-way-1-shot, J increases from 35.6% (best baseline) to 40.2% (DMA). The VideoSwin-Tiny backbone further elevates DMA performance to 51.5% (2-way-1-shot) and 41.4% (5-way-1-shot). The NS split reduces overall J due to absence of parent motion overlap, but DMA still maintains the highest accuracy: 46.0% (DMA) vs. 44.6% (best baseline) in 2-way-1-shot.
Ablation studies clarify the necessity of key components:
- Motion extractor: Mask pooling only (J=41.3%), mask-adapter (43.4%), frame differencing + Conv3D (46.8%).
- Prototype components: Appearance only (36.5%), motion only (43.8%), both (46.8%).
- Auxiliary heads: No aux heads (43.8%), only motion (44.2%), only object (43.5%), both (46.8%).
- Oracle upper bounds with ground-truth motion label: 63.6% J; ground-truth masks: 74.3% J.
Qualitative analysis illustrates that DMA leverages motion cues even in the presence of confounding object type or background: it can match “playing drums” regardless of the actor being a cat or person; correctly segments fine-grained motions like finger gestures (distinguishing temporal order); and is robust to misleading scenes (e.g., football played on a basketball court). t-SNE visualizations confirm that, with DMA, learned prototypes cluster by motion type rather than object class.
5. Comparison to Existing Approaches and Datasets
FSVOS and image-based FSS benchmarks provide support supervision at the image and object-category level (e.g., “panda”, “robot”), failing to model temporal motion dynamics. YouTube-VIS, FSVOD-500, and MiniVSPW are object-centric without explicit motion pattern modeling, limiting their utility in scenarios where action, not static class, defines the segmentation target. In contrast, MOVE uniquely incorporates motion-centric, video-level support and requires models to solve temporal correspondence between support and query sequences.
DMA's explicit separation of motion and appearance signals, paired with transformer-based prototype attention, affords significant improvements in segmentation that aligns with motion types, not just object identities.
6. Significance and Research Implications
MOVE establishes the first large-scale benchmark targeting motion-guided FSVOS and demonstrates, through empirical and qualitative analyses, that such tasks cannot be robustly solved by current object-centric or appearance-based segmentation methods (Ying et al., 29 Jul 2025). The consistently higher performance of DMA, especially on the more stringent NS split and fine-grained actions, underscores the necessity of motion-centric representation learning.
The motion-appearance disentanglement and transformer-based matching modules open further avenues for research, including few-shot motion understanding, temporal representation learning, and robust segmentation in scenarios demanding action-level generalization. The results and analyses presented in MOVE provide a foundation for developing models that prioritize dynamic patterns over static appearance, directly benefiting downstream tasks in human behavior analysis, robotics, and activity-based video understanding.