Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Benchmarking the Robustness of Temporal Action Detection Models Against Temporal Corruptions (2403.20254v1)

Published 29 Mar 2024 in cs.CV

Abstract: Temporal action detection (TAD) aims to locate action positions and recognize action categories in long-term untrimmed videos. Although many methods have achieved promising results, their robustness has not been thoroughly studied. In practice, we observe that temporal information in videos can be occasionally corrupted, such as missing or blurred frames. Interestingly, existing methods often incur a significant performance drop even if only one frame is affected. To formally evaluate the robustness, we establish two temporal corruption robustness benchmarks, namely THUMOS14-C and ActivityNet-v1.3-C. In this paper, we extensively analyze the robustness of seven leading TAD methods and obtain some interesting findings: 1) Existing methods are particularly vulnerable to temporal corruptions, and end-to-end methods are often more susceptible than those with a pre-trained feature extractor; 2) Vulnerability mainly comes from localization error rather than classification error; 3) When corruptions occur in the middle of an action instance, TAD models tend to yield the largest performance drop. Besides building a benchmark, we further develop a simple but effective robust training method to defend against temporal corruptions, through the FrameDrop augmentation and Temporal-Robust Consistency loss. Remarkably, our approach not only improves robustness but also yields promising improvements on clean data. We believe that this study will serve as a benchmark for future research in robust video analysis. Source code and models are available at https://github.com/Alvin-Zeng/temporal-robustness-benchmark.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (89)
  1. Threat of adversarial attacks on deep learning in computer vision: A survey. IEEE Access, 6:14410–14430, 2018.
  2. Diagnosing error in temporal action detectors. In Proceedings of the European conference on computer vision (ECCV), pages 256–272, 2018.
  3. Tsp: Temporally-sensitive pretraining of video encoders for localization tasks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3173–3183, 2021.
  4. Are transformers more robust than cnns? In Advances in Neural Information Processing Systems (NeurIPS), pages 26831–26843, 2021.
  5. Can machine learning be secure? In Proceedings of the ACM on Asia Conference on Computer and Communications Security (AsiaCCS), pages 16–25, 2006.
  6. Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In Proceedings of the British Machine Vision Conference (BMVC), 2021a.
  7. Robustness comparison of vision transformer and mlp-mixer to cnns. In Proceedings of the CVPR 2021 Workshop on Adversarial Machine Learning in Real-World Computer Vision Systems and Online Challenges (AML-CV), 2021b.
  8. Revisiting batch normalization for improving corruption robustness. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), pages 494–503, 2021c.
  9. Understanding robustness of transformers for image classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 10231–10241, 2021.
  10. Wild patterns: Ten years after the rise of adversarial machine learning. In Proceedings of the ACM Conference on Computer and Communications Security (CCS), pages 2154–2156, 2018.
  11. End-to-end, single-stream temporal action detection in untrimmed videos. In Proceedings of the British Machine Vision Conference (BMVC), 2017a.
  12. Sst: Single-stream temporal action proposals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2911–2920, 2017b.
  13. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 961–970, 2015.
  14. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6299–6308, 2017.
  15. Rethinking the faster r-cnn architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1130–1139, 2018.
  16. Universal adversarial perturbations: A survey. arXiv preprint arXiv:2005.08087, 2020.
  17. Automated content restoration system for file-based broadcasting environments. SMPTE Motion Imaging Journal, 124(8):39–46, 2015.
  18. Autoaugment: Learning augmentation policies from data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  19. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  20. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
  21. Explaining and harnessing adversarial examples. In Proceedings of the International Conference on Learning Representations (ICLR), 2014.
  22. Uncovering the limits of adversarial training against norm-bounded adversarial examples. arXiv preprint arXiv:2010.03593, 2020.
  23. Evaluating model robustness to patch perturbations. In ICML 2022 Shift Happens Workshop, 2022.
  24. Improving robustness of vision transformers by reducing sensitivity to patch corruptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4108–4118, 2023a.
  25. Robustifying token attention for vision transformers. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 17557–17568, 2023b.
  26. Robustify transformers with robust kernel density estimation. arXiv preprint arXiv:2210.05794, 2022.
  27. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
  28. Augmix: A simple data processing method to improve robustness and uncertainty. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.
  29. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 8340–8349, 2021.
  30. Rethinking spatial dimensions of vision transformers. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 11936–11945, 2021.
  31. Decoupling localization and classification in single shot temporal action detection. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), pages 1288–1293, 2019.
  32. Intra-clip aggregation for video person re-identification. In Proceedings of the IEEE International Conference on Image Processing (ICIP), pages 2336–2340, 2020.
  33. Thumos challenge: Action recognition with a large number of classes, 2014.
  34. Learning temporally invariant and localizable features via data augmentation for video recognition. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 386–403, 2020.
  35. Exploring temporally dynamic data augmentation for video recognition. In Proceedings of the International Conference on Learning Representations (ICLR), 2023.
  36. Dynamic hand gesture recognition using multi-direction 3d convolutional neural networks. Engineering Letters, 27(3), 2019.
  37. Not all patches are what you need: Expediting vision transformers via token reorganizations. arXiv preprint arXiv:2202.07800, 2022.
  38. Learning salient boundary feature for anchor-free temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3320–3329, 2021.
  39. Single shot temporal action detection. In Proceedings of the 25th ACM international conference on Multimedia, pages 988–996, 2017.
  40. Bmn: Boundary-matching network for temporal action proposal generation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3889–3898, 2019.
  41. An empirical study of end-to-end temporal action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 20010–20019, 2022a.
  42. End-to-end temporal action detection with transformer. IEEE Transactions on Image Processing, 31:5427–5441, 2022b.
  43. Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 15234–15243, 2022.
  44. Towards deep learning models resistant to adversarial attacks. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
  45. On the robustness of vision transformers to adversarial examples. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 7838–7847, 2021.
  46. Towards robust vision transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 12042–12051, 2022.
  47. Temporal activity detection in untrimmed videos with recurrent neural networks. 1st NIPS Workshop on Large Scale Computer Vision Systems (LSCVS), 2016.
  48. Evaluating prediction-time batch normalization for robustness under covariate shift. arXiv preprint arXiv:2006.10963, 2020.
  49. Proposal-free temporal action detection via global segmentation mask learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 645–662, 2022.
  50. On improving adversarial transferability of vision transformers. In Proceedings of the International Conference on Learning Representations (ICLR), 2022.
  51. Bag of tricks for adversarial training. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  52. Vision transformers are robust learners. In Proceedings of the Conference on Artificial Intelligence (AAAI), pages 2071–2081, 2022.
  53. Temporal gaussian mixture layer for videos. In Proceeding of the International Conference on Machine Learning (ICML), pages 5152–5161, 2019.
  54. Understanding and improving robustness of vision transformers through patch-based negative augmentation. In Advances in Neural Information Processing Systems (NeurIPS), pages 16276–16289, 2022.
  55. A simple way to make neural networks robust against diverse image corruptions. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
  56. Improving robustness against common corruptions with frequency biased models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 10211–10220, 2021.
  57. A large-scale robustness analysis of video action recognition models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 14698–14708, 2023.
  58. Improving robustness against common corruptions by covariate shift adaptation. In Advances in Neural Information Processing Systems (NeurIPS), pages 11539–11551, 2020.
  59. On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670, 2021.
  60. Tridet: Temporal action detection with relative boundary modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 18857–18866, 2023.
  61. Robustness verification for transformers. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.
  62. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1049–1058, 2016.
  63. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5734–5743, 2017.
  64. Opportunities and challenges in deep learning adversarial robustness: A survey. arXiv preprint arXiv:2007.00753, 2020.
  65. Intriguing properties of neural networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2014.
  66. Robustart: Benchmarking robustness on architecture design and training techniques. arXiv preprint arXiv:2109.05211, 2021.
  67. Exploring the relationship between architecture and adversarially robust generalization. arXiv preprint arXiv:2209.14105, 2022.
  68. Temporalmaxer: Maximize temporal context with only max pooling for temporal action localization. arXiv preprint arXiv:2303.09055, 2023.
  69. Deeper insights into vits robustness towards common corruptions. arXiv preprint arXiv:2204.12143, 2022.
  70. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), pages 5998–6008, 2017.
  71. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 14549–14560, 2023.
  72. Adversarial action data augmentation for similar gesture action recognition. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2019.
  73. R-c3d: Region convolutional 3d network for temporal activity detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 5783–5792, 2017.
  74. Adversarial attacks and defenses in images, graphs and text: A review. International Journal of Automation and Computing, 17(2):151–178, 2020a.
  75. G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10156–10165, 2020b.
  76. Revisiting anchor mechanisms for temporal action localization. IEEE Transactions on Image Processing, 29:8535–8548, 2020.
  77. Basictad: an astounding rgb-only baseline for temporal action detection. Computer Vision and Image Understanding, 232:103692, 2023.
  78. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2678–2687, 2016.
  79. Benchmarking the robustness of spatial-temporal models against corruptions. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021.
  80. A fourier perspective on model robustness in computer vision. In Advances in Neural Information Processing Systems (NeurIPS), pages 13255–13265, 2019.
  81. Adversarial examples: Attacks and defenses for deep learning. IEEE transactions on neural networks and learning systems, 30(9):2805–2824, 2019.
  82. Videomix: Rethinking data augmentation for video classification. arXiv preprint arXiv:2012.03457, 2020.
  83. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 7094–7103, 2019.
  84. Graph convolutional module for temporal action localization in videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6209–6223, 2021.
  85. Actionformer: Localizing moments of actions with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), pages 492–510, 2022.
  86. Self-paced video data augmentation by generative adversarial networks with insufficient samples. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1652–1660, 2020.
  87. Video self-stitching graph network for temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 13658–13667, 2021.
  88. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2914–2923, 2017.
  89. Understanding the robustness in vision transformers. In Proceedings of the International Conference on Machine Learning (ICML), pages 27378–27394, 2022.
Citations (2)

Summary

We haven't generated a summary for this paper yet.