Papers
Topics
Authors
Recent
Search
2000 character limit reached

SeqFormer: Sequential Transformer for Video Instance Segmentation

Published 15 Dec 2021 in cs.CV | (2112.08275v2)

Abstract: In this work, we present SeqFormer for video instance segmentation. SeqFormer follows the principle of vision transformer that models instance relationships among video frames. Nevertheless, we observe that a stand-alone instance query suffices for capturing a time sequence of instances in a video, but attention mechanisms shall be done with each frame independently. To achieve this, SeqFormer locates an instance in each frame and aggregates temporal information to learn a powerful representation of a video-level instance, which is used to predict the mask sequences on each frame dynamically. Instance tracking is achieved naturally without tracking branches or post-processing. On YouTube-VIS, SeqFormer achieves 47.4 AP with a ResNet-50 backbone and 49.0 AP with a ResNet-101 backbone without bells and whistles. Such achievement significantly exceeds the previous state-of-the-art performance by 4.6 and 4.4, respectively. In addition, integrated with the recently-proposed Swin transformer, SeqFormer achieves a much higher AP of 59.3. We hope SeqFormer could be a strong baseline that fosters future research in video instance segmentation, and in the meantime, advances this field with a more robust, accurate, neat model. The code is available at https://github.com/wjf5203/SeqFormer.

Citations (85)

Summary

  • The paper introduces a novel transformer method that unifies detection, segmentation, and tracking for video instance segmentation.
  • It employs unique frame-level query decomposition, achieving 47.4 AP with ResNet-50 and 59.3 AP with a Swin transformer backbone.
  • The approach streamlines VIS processing by eliminating complex post-processing while delivering robust real-time performance at 72.3 FPS.

SeqFormer: Sequential Transformer for Video Instance Segmentation

The paper presents SeqFormer, a novel approach to video instance segmentation (VIS) rooted in the principles of vision transformers. Video instance segmentation combines detection, classification, segmentation, and tracking of objects in videos, posing heightened challenges compared to static image instance segmentation due to the necessity of maintaining temporal coherence across frames.

Core Approach

SeqFormer distinguishes itself by adopting a seamless integration of detection and tracking through transformers. The key innovation lies in treating video-level instances holistically rather than independently across frames. Unlike traditional VIS methods that either extend static image segmentation models with tracking branches or segment instances over entire video clips, SeqFormer utilizes a singular instance query per object, dynamically applying attention to each frame independently.

The framework of SeqFormer includes:

  • Backbone and Encoder: Utilizes a CNN backbone to extract feature maps for each frame independently before passing them through a transformer encoder.
  • Query Decomposition: Harnessing a unique approach of decomposing an instance-level query into frame-level box queries enables focused attention on spatial regions across different frames, refining feature extraction in a coarse-to-fine manner.
  • Output Heads: Integration of mask, box, and class heads for comprehensive instance segmentation, mask sequence generation, and classification.

Empirical Performance

When benchmarked against YouTube-VIS datasets (2019 and 2021), SeqFormer notably outperforms existing models:

  • Achieves 47.4 AP with a ResNet-50 backbone, surpassing prior state-of-the-art methods by significant margins.
  • Further enhancement with a Swin transformer backbone yields a performance AP of 59.3, illustrating substantial improvements across configurational variations.

The model's capability to function robustly with the entire video as input avails it expressively in various practical scenarios without compromising speed — a noteworthy feature given its 72.3 FPS performance.

Contributions and Implications

SeqFormer's innovative use of decomposed attention aligns with the distinctive spatio-temporal demands of video data, challenging the status quo of treating time and space dimensions interchangeably. The study introduces a weighted feature aggregation method that intelligently discerns frames contributing valuable instance information, thereby enhancing the quality of instance representation.

Moreover, SeqFormer alleviates the need for complex post-processing or heuristic-based tracking mechanisms, fostering cleaner and more efficient model designs. The method's code accessibility establishes SeqFormer as a potent baseline for subsequent VIS research and development endeavors.

Future Directions

SeqFormer's approach resonates with the broader trajectory of integrating transformers into diverse areas of computer vision. Future explorations could involve refining the model's capability to distinguish overlaps in dense video sequences or expanding its applicability to real-time processing requirements, possibly integrating lightweight architectures or temporal coherency learning.

In conclusion, SeqFormer represents a significant evolution in applying transformer-based architectures to video instance segmentation, contributing valuable insights and metrics that propel understanding and development within this domain.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.