Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation (2312.06630v3)

Published 11 Dec 2023 in cs.CV

Abstract: Training on large-scale datasets can boost the performance of video instance segmentation while the annotated datasets for VIS are hard to scale up due to the high labor cost. What we possess are numerous isolated filed-specific datasets, thus, it is appealing to jointly train models across the aggregation of datasets to enhance data volume and diversity. However, due to the heterogeneity in category space, as mask precision increases with the data volume, simply utilizing multiple datasets will dilute the attention of models on different taxonomies. Thus, increasing the data scale and enriching taxonomy space while improving classification precision is important. In this work, we analyze that providing extra taxonomy information can help models concentrate on specific taxonomy, and propose our model named Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation (TMT-VIS) to address this vital challenge. Specifically, we design a two-stage taxonomy aggregation module that first compiles taxonomy information from input videos and then aggregates these taxonomy priors into instance queries before the transformer decoder. We conduct extensive experimental evaluations on four popular and challenging benchmarks, including YouTube-VIS 2019, YouTube-VIS 2021, OVIS, and UVO. Our model shows significant improvement over the baseline solutions, and sets new state-of-the-art records on all benchmarks. These appealing and encouraging results demonstrate the effectiveness and generality of our approach. The code is available at https://github.com/rkzheng99/TMT-VIS .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Mask2former for video instance segmentation. arXiv:2112.10764, 2021.
  2. Seqformer: Sequential transformer for video instance segmentation. In ECCV, 2022.
  3. Vita: Video instance segmentation via object token association. In NeurIPS, 2022.
  4. End-to-end video instance segmentation with transformers. In CVPR, 2021.
  5. Video instance segmentation using inter-frame communication transformers. In NeurIPS, 2021.
  6. Temporally efficient vision transformer for video instance segmentation. In CVPR, 2022.
  7. Segment anything. In ICCV, 2023.
  8. Video instance segmentation. In ICCV, 2019.
  9. Microsoft coco: Common objects in context. In ECCV, 2014.
  10. Multi-dataset training of transformers for robust action recognition. In NeurIPS, 2022.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  12. Occluded video instance segmentation: A benchmark. IJCV, 2022.
  13. Unidentified video objects: A benchmark for dense, open-world segmentation. In ICCV, 2021.
  14. Mask r-cnn. In ICCV, 2017.
  15. Sipmask: Spatial information preservation for fast image and video instance segmentation. In ECCV, 2020.
  16. Crossover learning for fast online video instance segmentation. In ICCV, 2021.
  17. Sg-net: Spatial granularity network for one-stage video instance segmentation. In CVPR, 2021.
  18. Minvis: A minimal video instance segmentation framework without video-based training. In NeurIPS, 2022.
  19. Video object segmentation using space-time memory networks. In ICCV, 2019.
  20. Motchallenge: A benchmark for single-camera multiple target tracking. IJCV, 2021.
  21. Mots: Multi-object tracking and segmentation. In CVPR, 2019.
  22. Compfeat: Comprehensive feature aggregation for video instance segmentation. In AAAI, 2021.
  23. Video instance segmentation with a propose-reduce paradigm. In ICCV, 2021.
  24. Visolo: Grid-based space-time aggregation for efficient online video instance segmentation. In CVPR, 2022.
  25. Tracking without bells and whistles. In ICCV, 2019.
  26. Quasi-dense similarity learning for multiple object tracking. In CVPR, 2021.
  27. Fairmot: On the fairness of detection and re-identification in multiple object tracking. IJCV, 2021.
  28. Global tracking transformers. In CVPR, 2022.
  29. Trackformer: Multi-object tracking with transformers. In CVPR, 2022.
  30. A generalized framework for video instance segmentation. In CVPR, 2023.
  31. In defense of online models for video instance segmentation. In ECCV, 2022.
  32. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020.
  33. Stem-seg: Spatio-temporal embeddings for instance segmentation in videos. In ECCV, 2020.
  34. Classifying, segmenting, and tracking object instances in video with mask propagation. In CVPR, 2020.
  35. End-to-end object detection with transformers. In ECCV, 2020.
  36. Bigdetection: A large-scale benchmark for improved object detector pre-training. In CVPR, 2022.
  37. Detection hub: Unifying object detection datasets via query adaptation on language embedding. In CVPR, 2023.
  38. Simple multi-dataset detection. In CVPR, 2022.
  39. Omdet: Language-aware object detection with large-scale vision-language multi-dataset pre-training. arXiv:2209.05946, 2022.
  40. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv:2307.06942, 2023.
  41. Aims: All-inclusive multi-level segmentation. In NeurIPS, 2023.
  42. Multi-modal domain adaptation for fine-grained action recognition. In CVPR, 2020.
  43. Multimodal co-training for selecting good examples from webly labeled video. arXiv:1804.06057, 2018.
  44. Videomae v2: Scaling video masked autoencoders with dual masking supplementary material. In CVPR, 2023.
  45. Videochat: Chat-centric video understanding. arXiv:2305.06355, 2023.
  46. Internvideo: General video foundation models via generative and discriminative learning. arXiv:2212.03191, 2022.
  47. Polyvit: Co-training vision transformers on images, videos and audio. TMLR, 2022.
  48. Uninext: Exploring a unified architecture for vision recognition. In ACMMM, 2023.
  49. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
  50. Detectron2. https://github.com/facebookresearch/detectron2, 2019.
  51. Deep residual learning for image recognition. In CVPR, 2016.
  52. Swin transformer: Hierarchical vision transformer using shifted windows. ICCV, 2021.
  53. Efficient video instance segmentation via tracklet query and proposal. In CVPR, 2022.
  54. The kinetics human action video dataset. arXiv:1705.06950, 2017.
Citations (2)

Summary

We haven't generated a summary for this paper yet.