Segment Anything Meets Point Tracking (2307.01197v2)
Abstract: The Segment Anything Model (SAM) has established itself as a powerful zero-shot image segmentation model, enabled by efficient point-centric annotation and prompt-based models. While click and brush interactions are both well explored in interactive image segmentation, the existing methods on videos focus on mask annotation and propagation. This paper presents SAM-PT, a novel method for point-centric interactive video segmentation, empowered by SAM and long-term point tracking. SAM-PT leverages robust and sparse point selection and propagation techniques for mask generation. Compared to traditional object-centric mask propagation strategies, we uniquely use point propagation to exploit local structure information agnostic to object semantics. We highlight the merits of point-based tracking through direct evaluation on the zero-shot open-world Unidentified Video Objects (UVO) benchmark. Our experiments on popular video object segmentation and multi-object segmentation tracking benchmarks, including DAVIS, YouTube-VOS, and BDD100K, suggest that a point-based segmentation tracker yields better zero-shot performance and efficient interactions. We release our code that integrates different point trackers and video segmentation benchmarks at https://github.com/SysCV/sam-pt.
- Speeded-up robust features (surf). Computer vision and image understanding, 110(3):346–359, 2008.
- The 2018 davis challenge on video object segmentation. In arXiv:1803.00557, 2018.
- The 2019 davis challenge on vos: Unsupervised multi-object segmentation. In arXiv:1905.00737, 2019.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- Mask2former for video instance segmentation. arXiv preprint arXiv: 2112.10764, 2021a.
- Pointly-supervised instance segmentation. In CVPR, pages 2617–2626, 2022.
- Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In ECCV, 2022.
- Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. In CVPR, 2021b.
- Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. In CVPR, 2021c.
- Tracking anything with decoupled video segmentation. In ICCV, 2023a.
- Segment and track anything. arXiv preprint arXiv:2305.06558, 2023b.
- Superpoint: Self-supervised interest point detection and description. In CVPRW, 2018.
- Mose: A new dataset for video object segmentation in complex scenes. arXiv preprint arXiv: 2302.01872, 2023.
- Tap-vid: A benchmark for tracking any point in a video. In NeurIPS, 2022.
- Tapir: Tracking any point with per-frame initialization and temporal refinement. ICCV, 2023.
- Particle video revisited: Tracking through occlusions using point trajectories. In ECCV, 2022.
- Interactive video object segmentation using global and local transfer modules. In ECCV, 2020.
- Space-time correspondence as a contrastive random walk. In NeurIPS, 2020.
- Cotracker: It is better to track together. arXiv preprint arXiv:2307.07635, 2023.
- Segment anything in high quality. In NeurIPS, 2023.
- Segment anything. In ICCV, pages 4015–4026, 2023.
- Learning to detect unseen object classes by between-class attribute transfer. In CVPR, pages 951–958. IEEE, 2009.
- Recurrent dynamic embedding for video object segmentation. In CVPR, 2022.
- Query-memory re-aggregation for weakly-supervised video object segmentation. In AAAI, 2021.
- David G Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60:91–110, 2004.
- An iterative image registration technique with an application to stereo vision. In IJCAI, 1981.
- Memory aggregation networks for efficient interactive video object segmentation. In CVPR, 2020.
- Fast user-guided video object segmentation by interaction-and-propagation networks. In CVPR, 2019.
- A simple and fast algorithm for k-medoids clustering. Expert Systems with Applications, 36(2, Part 2):3336–3341, 2009.
- The 2017 davis challenge on video object segmentation. arXiv:1704.00675, 2017.
- Learning transferable visual models from natural language supervision. 2021.
- Particle video: Long-range motion estimation using point trajectories. IJCV, 80:72–91, 2008.
- Superglue: Learning feature matching with graph neural networks. In CVPR, 2020.
- Jianbo Shi and Tomasi. Good features to track. In CVPR, 1994.
- Raft: Recurrent all-pairs field transforms for optical flow. In ECCV, 2020.
- Detection and tracking of point. IJCV, 9:137–154, 1991.
- Interactive video cutout. In ToG, 2005.
- Fast online object tracking and segmentation: A unifying approach. In CVPR, 2019.
- Tracking everything everywhere all at once. In ICCV, 2023a.
- Unidentified video objects: A benchmark for dense, open-world segmentation. In ICCV, 2021.
- Images speak in images: A generalist painter for in-context visual learning. In CVPR, 2023b.
- Seggpt: Segmenting everything in context. In ICCV, 2023c.
- Seqformer: a frustratingly simple model for video instance segmentation. In ECCV, 2022.
- Youtube-vos: A large-scale video object segmentation benchmark, 2018.
- Self-supervised video object segmentation by motion grouping. In ICCV, 2021.
- Track anything: Segment anything meets videos. arXiv preprint arXiv:2304.11968, 2023.
- Decoupling features in hierarchical propagation for video object segmentation. In NeurIPS, 2022a.
- Decoupling features in hierarchical propagation for video object segmentation. In NeurIPS, 2022b.
- Lift: Learned invariant feature transform. ECCV, 2016.
- Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Robust online video instance segmentation with track queries. arXiv preprint arXiv: 2211.09108, 2022.
- Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023.
- Pointodyssey: A large-scale synthetic dataset for long-term point tracking. In ICCV, 2023.
- Frano Rajič (4 papers)
- Lei Ke (31 papers)
- Yu-Wing Tai (123 papers)
- Chi-Keung Tang (81 papers)
- Martin Danelljan (96 papers)
- Fisher Yu (104 papers)