Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adapting Short-Term Transformers for Action Detection in Untrimmed Videos (2312.01897v2)

Published 4 Dec 2023 in cs.CV

Abstract: Vision Transformer (ViT) has shown high potential in video recognition, owing to its flexible design, adaptable self-attention mechanisms, and the efficacy of masked pre-training. Yet, it remains unclear how to adapt these pre-trained short-term ViTs for temporal action detection (TAD) in untrimmed videos. The existing works treat them as off-the-shelf feature extractors for each short-trimmed snippet without capturing the fine-grained relation among different snippets in a broader temporal context. To mitigate this issue, this paper focuses on designing a new mechanism for adapting these pre-trained ViT models as a unified long-form video transformer to fully unleash its modeling power in capturing inter-snippet relation, while still keeping low computation overhead and memory consumption for efficient TAD. To this end, we design effective cross-snippet propagation modules to gradually exchange short-term video information among different snippets from two levels. For inner-backbone information propagation, we introduce a cross-snippet propagation strategy to enable multi-snippet temporal feature interaction inside the backbone.For post-backbone information propagation, we propose temporal transformer layers for further clip-level modeling. With the plain ViT-B pre-trained with VideoMAE, our end-to-end temporal action detector (ViT-TAD) yields a very competitive performance to previous temporal action detectors, riching up to 69.5 average mAP on THUMOS14, 37.40 average mAP on ActivityNet-1.3 and 17.20 average mAP on FineAction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Diagnosing error in temporal action detectors. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part III, volume 11207 of Lecture Notes in Computer Science, pages 264–280. Springer, 2018.
  2. Vivit: A video vision transformer. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 6816–6826. IEEE, 2021.
  3. Boundary content graph neural network for temporal action proposal generation. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, ECCV, volume 12373, pages 121–137, 2020.
  4. Is space-time attention all you need for video understanding? In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 813–824. PMLR, 2021.
  5. Space-time mixing attention for video transformer. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 19594–19607, 2021.
  6. End-to-end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I, volume 12346 of Lecture Notes in Computer Science, pages 213–229. Springer, 2020.
  7. Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR, pages 4724–4733, 2017.
  8. Faster-tad: Towards temporal action detection with proposal generation and classification in a unified network. CoRR, abs/2204.02674, 2022.
  9. Tallformer: Temporal action localization with a long-memory transformer. In Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXIV, volume 13694 of Lecture Notes in Computer Science, pages 503–521. Springer, 2022.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  11. Multiscale vision transformers. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 6804–6815. IEEE, 2021.
  12. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, pages 961–970, 2015.
  13. THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/, 2014.
  14. Exploring plain vision transformer backbones for object detection. In Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IX, volume 13669 of Lecture Notes in Computer Science, pages 280–296. Springer, 2022.
  15. Fast learning of temporal action proposal via dense boundary generator. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 11499–11506, 2020.
  16. Learning salient boundary feature for anchor-free temporal action localization. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 3320–3329. Computer Vision Foundation / IEEE, 2021.
  17. BMN: boundary-matching network for temporal action proposal generation. In ICCV, pages 3888–3897, 2019.
  18. BSN: boundary sensitive network for temporal action proposal generation. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, ECCV, volume 11208, pages 3–21, 2018.
  19. Progressive boundary refinement network for temporal action detection. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 11612–11619. AAAI Press, 2020.
  20. An empirical study of end-to-end temporal action detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 19978–19987. IEEE, 2022.
  21. Fineaction: A fine-grained video dataset for temporal action localization. CoRR, abs/2105.11107, 2021.
  22. Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 9992–10002. IEEE, 2021.
  23. Video swin transformer. Journal of Foo, 14(1):234–778, 2004.
  24. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
  25. Video transformer network. CoRR, abs/2102.00719, 2021.
  26. Rethinking video vits: Sparse video tubes for joint image and video learning. CoRR, abs/2212.03229, 2022.
  27. Grad-cam: Visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis., 128(2):336–359, 2020.
  28. An image is worth 16x16 words, what is a video worth? CoRR, abs/2103.13915, 2021.
  29. Tridet: Temporal action detection with relative boundary modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18857–18866, 2023.
  30. React: Temporal action detection with relational queries. In Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part X, volume 13670 of Lecture Notes in Computer Science, pages 105–121. Springer, 2022.
  31. Relaxed transformer decoders for direct action proposal generation. In ICCV, pages 13526–13535, 2021.
  32. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. CoRR, abs/2203.12602, 2022.
  33. Training data-efficient image transformers & distillation through attention. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 10347–10357. PMLR, 2021.
  34. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017.
  35. RGB stream is enough for temporal action detection. CoRR, abs/2107.04362, 2021.
  36. Videomae V2: scaling video masked autoencoders with dual masking. CoRR, abs/2303.16727, 2023.
  37. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, pages 20–36, 2016.
  38. Internvideo: General video foundation models via generative and discriminative learning. CoRR, abs/2212.03191, 2022.
  39. An efficient spatio-temporal pyramid transformer for action detection. In Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXIV, volume 13694 of Lecture Notes in Computer Science, pages 358–375. Springer, 2022.
  40. CUHK & ETHZ & SIAT submission to activitynet challenge 2016. CoRR, abs/1608.00797, 2016.
  41. G-TAD: sub-graph localization for temporal action detection. In CVPR, pages 10153–10162. Computer Vision Foundation / IEEE, 2020.
  42. Basictad: an astounding rgb-only baseline for temporal action detection. CoRR, abs/2205.02717, 2022.
  43. Actionformer: Localizing moments of actions with transformers. In Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IV, volume 13664 of Lecture Notes in Computer Science, pages 492–510. Springer, 2022.
  44. Re^2tal: Rewiring pretrained video backbones for reversible temporal action localization. CoRR, abs/2211.14053, 2022.
  45. Video self-stitching graph network for temporal action localization. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 13638–13647. IEEE, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Min Yang (239 papers)
  2. Huan Gao (14 papers)
  3. Ping Guo (38 papers)
  4. Limin Wang (221 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.