Segment Anything Across Shots (SAAS)
- The paper introduces SAAS, a framework that overcomes shot transition challenges by integrating transition mimicking augmentation with a transition-aware segmentation architecture.
- It employs a novel Transition Detection Module and local memory banks to maintain object consistency across abrupt scene cuts.
- Benchmarks on YouMVOS and Cut-VOS demonstrate state-of-the-art performance with over 5% improvement in key segmentation metrics.
Segment Anything Across Shots (SAAS) is a methodological and architectural framework for robust object segmentation across video sequences that contain multiple shots and abrupt frame discontinuities. Unlike classic video object segmentation (VOS), which predominantly assumes single-shot sequences, SAAS directly addresses multi-shot semi-supervised video object segmentation (MVOS)—the task of tracking and segmenting objects of interest across shot boundaries, starting from an initial mask prompt. SAAS encompasses both a transition-mimicking data augmentation strategy and a transition-aware segmentation architecture, validated through the introduction of dedicated benchmarks for multi-shot video segmentation (Hu et al., 17 Nov 2025).
1. Motivation and Problem Statement
Standard VOS methods are optimized for single-shot settings in which frames are temporally coherent and changes in appearance unfold gradually. In contrast, real-world videos—including movies and edited content—are composed of multiple shots, separated by abrupt transitions (“cuts”), causing objects to reappear with dramatic shifts in scale, illumination, or view. This context introduces two critical challenges:
- Shot Transition Discontinuity: Memory-based propagation fails at cut points due to absent temporal continuity, resulting in segmentation drift or identity loss.
- Scarcity of Labeled Multi-Shot Data: Annotated multi-shot segmentation datasets are rare, limiting the training of models capable of generalizing across shot boundaries.
SAAS was developed to overcome these limitations through the combination of transition-aware augmentation and an architecture explicitly designed to detect, comprehend, and bridge shot transitions (Hu et al., 17 Nov 2025).
2. Transition Mimicking Augmentation (TMA)
Transition Mimicking Augmentation (TMA) is a stochastic data transformation pipeline that synthesizes artificial shot transitions within single-shot video clips, enabling the model to develop cross-shot generalization abilities even in the absence of true multi-shot training data. TMA proceeds as follows:
- Given a clip of eight frames——sampled from a single-shot VOS dataset, TMA injects between zero and multiple synthetic transitions based on the probability .
- For each transition, parameters such as the probability of cutting (), whether to cut within the same or a different video (), foreground copy rate (), and horizontal flip () are sampled to control the transition characteristics.
- If a transition is applied, one or more frames are replaced by sampled frames (with or without masks) from another clip, optionally with geometric augmentations (moderate/strong affine transformations).
- In the “copy foreground” operation, , the foreground from is transplanted into using mask , ensuring varied but realistic reappearances of the object.
- When no shot cut is introduced, strong augmentations are instead applied to maintain data variability.
Optimal hyperparameters were empirically found to be (Hu et al., 17 Nov 2025).
3. SAAS Model Architecture
The SAAS model comprises a transition-aware extension of a promptable segmentation backbone (SAM2 Hiera-MAE), incorporating several specialized modules and memory banks to address multi-shot dynamics:
- Transition Detection Module (TDM): Utilizes a dilated-convolutional feature pyramid, combined with a Sigmoid activation, to produce the transition score . If the score surpasses the threshold , a shot transition is detected, triggering transition-aware processing.
- Transition Comprehension Module (TCH): When a transition is flagged, this module encodes a transition state embedding via multi-block cross-attention over the current frame’s feature maps, previous adjacent memory, and scene-level memory. Auxiliary objectives on encourage correct existence (BCE loss ) and spatial localization (MSE loss ).
- Local Memory Bank (): Extracts high-resolution patch features for local object regions via minimum-spanning-tree clustering on the reference frame's mask, ensuring robustness to object re-identification across similar distractors.
- Integration: The segmentation head (SAM2 decoder) is fed a concatenation of conditional, adjacent, scene, and local memory features, and receives mask or point prompts propagated across shots. Losses include cross-entropy, focal, dice, IoU, plus the auxiliary transition-aware objectives.
Memory management spans four banks: conditional (), adjacent (), scene-level (), and local (). The architecture design and attention-based memory fusion are critical for enabling both within-shot temporal coherence and across-shot re-identification (Hu et al., 17 Nov 2025).
4. Datasets and Benchmarking
To facilitate systematic evaluation of SAAS techniques, the Cut-VOS benchmark was introduced, offering:
- 100 videos, totaling approximately 1590 seconds, 10.2K pixel-precise object masks, 174 objects across 11 coarse and 40 fine-grained categories, and 648 annotated shot transitions.
- An average of 11.3 shot-segments per video and a transition frequency of 0.346/s (significantly higher than prior YouMVOS).
- Exhaustive annotation of various transition types, including cut-in, cut-away, delayed cut-in, view changes, and occlusion-driven transitions.
Metrics include region similarity (), contour accuracy (), and cross-shot region accuracy (), measuring the model’s ability to track object masks across boundaries.
| Dataset | Videos | Objects | Masks | Shots | Trans./s | Categories |
|---|---|---|---|---|---|---|
| YouMVOS† | 30* | 78* | 64.6K* | 2.4K | 0.222 | 4 |
| Cut-VOS | 100 | 174 | 10.2K | 648 | 0.346 | 11 |
*Indicates stats for the YouMVOS† test split. (Hu et al., 17 Nov 2025)
5. Quantitative Results and Ablations
On both YouMVOS and Cut-VOS, SAAS achieves state-of-the-art performance compared to prior and contemporary baselines, including XMem, DEVA, Cutie, and SAM2 (Hu et al., 17 Nov 2025).
| Method | Params (M) | FPS | YouMVOS (J/F/J_t) | Cut-VOS (J/F/J_t) |
|---|---|---|---|---|
| XMem | 62.2 | 45 | 61.7 / 62.1 / 61.9 | 48.4 / 51.4 / 49.9 |
| DEVA | 61.2 | 37 | 63.3 / 64.5 / 63.9 | 47.3 / 50.8 / 49.1 |
| Cutie | 35.0 | 40 | 67.3 / 68.1 / 67.7 | 51.0 / 53.6 / 52.3 |
| SAM2-B+ | 80.9 | 22 | 67.6 / 67.6 / 67.6 | 54.0 / 56.4 / 55.2 |
| Cutie+TMA | 35.0 | 40 | 69.1 / 70.0 / 69.6 | 52.0 / 55.0 / 53.5 |
| SAAS-B+ | 92.5 | 21 | 73.4 / 73.7 / 73.5 | 59.4 / 61.9 / 60.7 |
| SAAS-L | 235.6 | 14 | 74.0 / 74.4 / 74.2 | 60.5 / 63.6 / 62.0 |
SAAS-B+ surpasses SAM2-B+ by 5.8% in and 5.5% in cross-shot region similarity () on YouMVOS, and maintains >5% gain on Cut-VOS.
Ablation studies reveal the independent contributions of TMA, local memory, and transition comprehension modules. TMA alone yields +2.8% , local memory alone +2.4%, while the combination of TMA and transition comprehension achieves +4.9%. Full SAAS integration provides a +5.5% improvement (Hu et al., 17 Nov 2025).
6. Architectural Positioning and Relation to SAM2
Segment Anything Across Shots (SAAS) builds directly on the streaming-memory, promptable segmentation backbone of SAM2 (Ravi et al., 1 Aug 2024). The core innovations are:
- Promptable segmentation: Accepting minimal user input (point, box, or mask) on the initial frame, using a hierarchical ViT encoder and streaming attention-based memory for propagation.
- Multi-bank memory: SAAS adds explicit handling of different temporal contexts (adjacent, scene, conditional, local), in contrast to the bounded FIFO memory of SAM2.
- Transition-specific processing: Incorporation of modules for transition detection and comprehension, enabling the model to recognize abrupt scene changes and to recover object identity after a cut—capabilities not present in base SAM2.
A plausible implication is that SAAS methodologies could be generalized to other promptable segmentation architectures with memory, provided sufficient performance on transition detection and memory reset policies.
7. Practical Implementation and Limitations
SAAS model training proceeds in two phases: (I) freezing the SAM2 backbone to train TDM on dedicated shot-boundary datasets, and (II) full end-to-end training with TMA-augmented VOS data. Training uses AdamW (learning rate decaying to ), batch size 8 clips per GPU, and is evaluated at real-time speeds (SAAS-B+ at 21 FPS).
At inference, frames are resized to a minimum side length of 480 pixels, with segmentation masks post-processed via linear upsampling. The system is robust to abrupt transitions, high shot frequency, and crowded, distractor-rich scenes, though limitations remain in tracking through long occlusions or drastic changes without prompt re-anchoring.
SAAS releases both models and the Cut-VOS benchmark, enabling further research into generalized cross-shot video object segmentation (Hu et al., 17 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free