What is Point Supervision Worth in Video Instance Segmentation? (2404.01990v1)
Abstract: Video instance segmentation (VIS) is a challenging vision task that aims to detect, segment, and track objects in videos. Conventional VIS methods rely on densely-annotated object masks which are expensive. We reduce the human annotations to only one point for each object in a video frame during training, and obtain high-quality mask predictions close to fully supervised models. Our proposed training method consists of a class-agnostic proposal generation module to provide rich negative samples and a spatio-temporal point-based matcher to match the object queries with the provided point annotations. Comprehensive experiments on three VIS benchmarks demonstrate competitive performance of the proposed framework, nearly matching fully supervised methods.
- Shuaiyi Huang (12 papers)
- De-An Huang (45 papers)
- Zhiding Yu (94 papers)
- Shiyi Lan (38 papers)
- Subhashree Radhakrishnan (7 papers)
- Jose M. Alvarez (90 papers)
- Abhinav Shrivastava (120 papers)
- Anima Anandkumar (236 papers)