Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization (2312.17686v2)
Abstract: Action Localization is a challenging problem that combines detection and recognition tasks, which are often addressed separately. State-of-the-art methods rely on off-the-shelf bounding box detections pre-computed at high resolution, and propose transformer models that focus on the classification task alone. Such two-stage solutions are prohibitive for real-time deployment. On the other hand, single-stage methods target both tasks by devoting part of the network (generally the backbone) to sharing the majority of the workload, compromising performance for speed. These methods build on adding a DETR head with learnable queries that after cross- and self-attention can be sent to corresponding MLPs for detecting a person's bounding box and action. However, DETR-like architectures are challenging to train and can incur in big complexity. In this paper, we observe that \textbf{a straight bipartite matching loss can be applied to the output tokens of a vision transformer}. This results in a backbone + MLP architecture that can do both tasks without the need of an extra encoder-decoder head and learnable queries. We show that a single MViTv2-S architecture trained with bipartite matching to perform both tasks surpasses the same MViTv2-S when trained with RoI align on pre-computed bounding boxes. With a careful design of token pooling and the proposed training pipeline, our Bipartite-Matching Vision Transformer model, \textbf{BMViT}, achieves +3 mAP on AVA2.2. w.r.t. the two-stage MViTv2-S counterpart. Code is available at \href{https://github.com/IoannaNti/BMViT}{https://github.com/IoannaNti/BMViT}
- End-to-end object detection with transformers. In ECCV, 2020.
- Efficient video action detection with token dropout and context refinement. In ICCV, 2023.
- Watch only once: An end-to-end video action detection framework. In ICCV, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Multiscale vision transformers. In ICCV, 2021.
- Slowfast networks for video recognition. In ICCV, 2019.
- Large-scale weaklysupervised pre-training for video action recognition. In CVPR, 2019.
- Video action transformer network. In CVPR, 2019.
- AVA: A video dataset of spatio-temporally localized atomic visual actions. In CVPR, 2018.
- Masked autoencoders are scalable vision learners. In CVPR, 2021.
- Mask r-cnn. In ICCV, 2017.
- Towards understanding action recognition. In ICCV, 2013.
- Action tubelet detector for spatio-temporal action localization. In ICCV, 2017.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- You only watch once: A unified cnn architecture for real-time spatiotemporal action localization. In arXiv preprint arXiv:1911.06644, 2019.
- Harold W. Kuhn. The Hungarian Method for the Assignment Problem. Naval Research Logistics Quarterly, 2(1–2):83–97, March 1955.
- The ava-kinetics localized human actions video dataset. arXiv preprint arXiv:2005.00214, 2020.
- Unmasked teacher: Towards training-efficient video foundation models. In ICCV, 2023.
- MViTv2: Improved multiscale vision transformers for classification and detection. 2022.
- Microsoft COCO: common objects in context. 2014.
- Feature pyramid networks for object detection. In CVPR, 2017.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Simple open-vocabulary object detection with vision transformers. In ECCV, 2022.
- Actor-context-actor relation network for spatio-temporal action localization. In CVPR, 2021.
- Pytorch: An imperative style, high-performance deep learning library. Advances on Neural Information Processing Systems, 2019.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 2015.
- Generalized intersection over union: A metric and a loss for bounding box regression. In CVPR, 2019.
- Hiera: A hierarchical vision transformer without the bells-and-whistles, 2023.
- Online real-time multiple spatiotemporal action localisation and prediction. In ICCV, 2017.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- A simple and efficient pipeline to build an end-to-end spatial-temporal action detector. In WACV, 2022.
- Actor-centric relation network. In ECCV, 2018.
- Asynchronous interaction aggregation for action detection. In ECCV, 2020.
- VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Advances on Neural Information Processing Systems, 2022.
- Actor conditioned attention maps for video action detection. In IEEE Winter Conference on Applications of Computer Vision, 2020.
- The kinetics human action video dataset. 2017.
- Long-term feature banks for detailed video understanding. In CVPR, 2019.
- Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In CVPR, 2022.
- Context-aware RCNN: A baseline for action detection in videos. In ECCV, 2020.
- Stmixer: A one-stage sparse action detector. In CVPR, 2023.
- A structured model for action detection. In CVPR, 2019.
- Tuber: Tube-transformer for action detection. arXiv preprint arXiv:2104.00969, 2021.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.