Papers
Topics
Authors
Recent
2000 character limit reached

Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization (2312.17686v2)

Published 29 Dec 2023 in cs.CV

Abstract: Action Localization is a challenging problem that combines detection and recognition tasks, which are often addressed separately. State-of-the-art methods rely on off-the-shelf bounding box detections pre-computed at high resolution, and propose transformer models that focus on the classification task alone. Such two-stage solutions are prohibitive for real-time deployment. On the other hand, single-stage methods target both tasks by devoting part of the network (generally the backbone) to sharing the majority of the workload, compromising performance for speed. These methods build on adding a DETR head with learnable queries that after cross- and self-attention can be sent to corresponding MLPs for detecting a person's bounding box and action. However, DETR-like architectures are challenging to train and can incur in big complexity. In this paper, we observe that \textbf{a straight bipartite matching loss can be applied to the output tokens of a vision transformer}. This results in a backbone + MLP architecture that can do both tasks without the need of an extra encoder-decoder head and learnable queries. We show that a single MViTv2-S architecture trained with bipartite matching to perform both tasks surpasses the same MViTv2-S when trained with RoI align on pre-computed bounding boxes. With a careful design of token pooling and the proposed training pipeline, our Bipartite-Matching Vision Transformer model, \textbf{BMViT}, achieves +3 mAP on AVA2.2. w.r.t. the two-stage MViTv2-S counterpart. Code is available at \href{https://github.com/IoannaNti/BMViT}{https://github.com/IoannaNti/BMViT}

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. End-to-end object detection with transformers. In ECCV, 2020.
  2. Efficient video action detection with token dropout and context refinement. In ICCV, 2023.
  3. Watch only once: An end-to-end video action detection framework. In ICCV, 2021.
  4. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  5. Multiscale vision transformers. In ICCV, 2021.
  6. Slowfast networks for video recognition. In ICCV, 2019.
  7. Large-scale weaklysupervised pre-training for video action recognition. In CVPR, 2019.
  8. Video action transformer network. In CVPR, 2019.
  9. AVA: A video dataset of spatio-temporally localized atomic visual actions. In CVPR, 2018.
  10. Masked autoencoders are scalable vision learners. In CVPR, 2021.
  11. Mask r-cnn. In ICCV, 2017.
  12. Towards understanding action recognition. In ICCV, 2013.
  13. Action tubelet detector for spatio-temporal action localization. In ICCV, 2017.
  14. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  15. You only watch once: A unified cnn architecture for real-time spatiotemporal action localization. In arXiv preprint arXiv:1911.06644, 2019.
  16. Harold W. Kuhn. The Hungarian Method for the Assignment Problem. Naval Research Logistics Quarterly, 2(1–2):83–97, March 1955.
  17. The ava-kinetics localized human actions video dataset. arXiv preprint arXiv:2005.00214, 2020.
  18. Unmasked teacher: Towards training-efficient video foundation models. In ICCV, 2023.
  19. MViTv2: Improved multiscale vision transformers for classification and detection. 2022.
  20. Microsoft COCO: common objects in context. 2014.
  21. Feature pyramid networks for object detection. In CVPR, 2017.
  22. Microsoft coco: Common objects in context. In ECCV, 2014.
  23. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  24. Simple open-vocabulary object detection with vision transformers. In ECCV, 2022.
  25. Actor-context-actor relation network for spatio-temporal action localization. In CVPR, 2021.
  26. Pytorch: An imperative style, high-performance deep learning library. Advances on Neural Information Processing Systems, 2019.
  27. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 2015.
  28. Generalized intersection over union: A metric and a loss for bounding box regression. In CVPR, 2019.
  29. Hiera: A hierarchical vision transformer without the bells-and-whistles, 2023.
  30. Online real-time multiple spatiotemporal action localisation and prediction. In ICCV, 2017.
  31. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  32. A simple and efficient pipeline to build an end-to-end spatial-temporal action detector. In WACV, 2022.
  33. Actor-centric relation network. In ECCV, 2018.
  34. Asynchronous interaction aggregation for action detection. In ECCV, 2020.
  35. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Advances on Neural Information Processing Systems, 2022.
  36. Actor conditioned attention maps for video action detection. In IEEE Winter Conference on Applications of Computer Vision, 2020.
  37. The kinetics human action video dataset. 2017.
  38. Long-term feature banks for detailed video understanding. In CVPR, 2019.
  39. Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In CVPR, 2022.
  40. Context-aware RCNN: A baseline for action detection in videos. In ECCV, 2020.
  41. Stmixer: A one-stage sparse action detector. In CVPR, 2023.
  42. A structured model for action detection. In CVPR, 2019.
  43. Tuber: Tube-transformer for action detection. arXiv preprint arXiv:2104.00969, 2021.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.