Papers
Topics
Authors
Recent
2000 character limit reached

Action Detection via an Image Diffusion Process

Published 1 Apr 2024 in cs.CV | (2404.01051v1)

Abstract: Action detection aims to localize the starting and ending points of action instances in untrimmed videos, and predict the classes of those instances. In this paper, we make the observation that the outputs of the action detection task can be formulated as images. Thus, from a novel perspective, we tackle action detection via a three-image generation process to generate starting point, ending point and action-class predictions as images via our proposed Action Detection Image Diffusion (ADI-Diff) framework. Furthermore, since our images differ from natural images and exhibit special properties, we further explore a Discrete Action-Detection Diffusion Process and a Row-Column Transformer design to better handle their processing. Our ADI-Diff framework achieves state-of-the-art results on two widely-used datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Deep learning approach for suspicious activity detection from surveillance video. In 2020 2nd International Conference on Innovative Mechanisms for Industry Applications (ICIMIA), pages 335–339. IEEE, 2020.
  2. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021.
  3. Boundary content graph neural network for temporal action proposal generation. In European Conference on Computer Vision, pages 121–137. Springer, 2020.
  4. Soft-nms–improving object detection with one line of code. In Proceedings of the IEEE international conference on computer vision, pages 5561–5569, 2017.
  5. Sst: Single-stream temporal action proposals. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2911–2920, 2017.
  6. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015.
  7. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  8. Augmented transformer with adaptive graph for temporal action proposal generation. arXiv preprint arXiv:2103.16024, 2021.
  9. Rethinking the faster r-cnn architecture for temporal action localization. In proceedings of the IEEE conference on computer vision and pattern recognition, pages 1130–1139, 2018.
  10. Dcan: Improving temporal action detection via dual context aggregation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 248–257, 2022.
  11. Diffusiondet: Diffusion model for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19830–19843, 2023.
  12. Ms-tct: multi-scale temporal convtransformer for action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20041–20051, 2022.
  13. Convit: Improving vision transformers with soft convolutional inductive biases. In International Conference on Machine Learning, pages 2286–2296. PMLR, 2021.
  14. Distribution-aligned diffusion for human mesh recovery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9221–9232, 2023a.
  15. Ai-generated content (aigc) for various data modalities: A survey. arXiv preprint arXiv:2308.14177, 2, 2023b.
  16. Turn tap: Temporal unit regression network for temporal action proposals. In Proceedings of the IEEE international conference on computer vision, pages 3628–3636, 2017.
  17. Soccernet: A scalable dataset for action spotting in soccer videos. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 1711–1721, 2018.
  18. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
  19. Scale matters: Temporal scale aggregation network for precise action localization in untrimmed videos. In 2020 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2020.
  20. Diffpose: Toward more reliable 3d pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  21. Uncertainty-based spatial-temporal attention for online action detection. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, pages 69–86. Springer, 2022.
  22. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  23. The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding, 155:1–23, 2017.
  24. Automatic soccer video event detection based on a deep neural network combined cnn and rnn. In 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI), pages 490–494. IEEE, 2016.
  25. Self-feedback detr for temporal action detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10286–10296, 2023.
  26. Graph attention based proposal 3d convnets for action detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4626–4633, 2020.
  27. Diffusion-lm improves controllable text generation. arXiv preprint arXiv:2205.14217, 2022.
  28. Fast learning of temporal action proposal via dense boundary generator. In Proceedings of the AAAI conference on artificial intelligence, pages 11499–11506, 2020.
  29. Learning salient boundary feature for anchor-free temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3320–3329, 2021.
  30. Bsn: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
  31. Bmn: Boundary-matching network for temporal action proposal generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3889–3898, 2019.
  32. Diffusion action segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10139–10149, 2023.
  33. Progressive boundary refinement network for temporal action detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 11612–11619, 2020.
  34. Multi-shot temporal event localization: a benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12596–12606, 2021.
  35. End-to-end temporal action detection with transformer. IEEE Transactions on Image Processing, 31:5427–5441, 2022.
  36. Multi-granularity generator for temporal action proposal. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3604–3613, 2019.
  37. Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 344–353, 2019.
  38. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  39. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2837–2845, 2021.
  40. Proposal-free temporal action detection via global segmentation mask learning. In European Conference on Computer Vision, pages 645–662. Springer, 2022a.
  41. Semi-supervised temporal action detection with proposal-free masking. In European Conference on Computer Vision, pages 663–680. Springer, 2022b.
  42. Difftad: Temporal action detection with proposal denoising diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10362–10374, 2023.
  43. Activity graph transformer for temporal action localization. arXiv preprint arXiv:2101.08540, 2021.
  44. Data fusion and multiple classifier systems for human activity detection and health monitoring: Review and open research directions. Information Fusion, 46:147–170, 2019.
  45. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 485–494, 2021.
  46. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  47. Vision-based patient monitoring: a comprehensive review of algorithms and technologies. Journal of Ambient Intelligence and Humanized Computing, 9:225–251, 2018.
  48. Rgb-d data-based action recognition: A review. Sensors, 21(12):4246, 2021.
  49. Diffusion-based 3d human pose estimation with multi-hypothesis aggregation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14761–14771, 2023.
  50. React: Temporal action detection with relational queries. In European conference on computer vision, pages 105–121. Springer, 2022.
  51. Tridet: Temporal action detection with relative boundary modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18857–18866, 2023.
  52. Make-a-video: Text-to-video generation without text-video data. In The Eleventh International Conference on Learning Representations, 2023.
  53. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  54. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
  55. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.
  56. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020.
  57. Class semantics-based attention for action detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13739–13748, 2021.
  58. Relaxed transformer decoders for direct action proposal generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13526–13535, 2021.
  59. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018.
  60. Deep learning-based action detection in untrimmed videos: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  61. Actionness estimation using hybrid fully convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2708–2717, 2016.
  62. An efficient spatio-temporal pyramid transformer for action detection. In European Conference on Computer Vision, pages 358–375. Springer, 2022.
  63. R-c3d: Region convolutional 3d network for temporal activity detection. In Proceedings of the IEEE international conference on computer vision, pages 5783–5792, 2017.
  64. 6d-diff: A keypoint diffusion framework for 6d object pose estimation. arXiv preprint arXiv:2401.00029, 2023.
  65. G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10156–10165, 2020.
  66. Revisiting anchor mechanisms for temporal action localization. IEEE Transactions on Image Processing, 29:8535–8548, 2020.
  67. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2678–2687, 2016.
  68. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7094–7103, 2019.
  69. Actionformer: Localizing moments of actions with transformers. In European Conference on Computer Vision, pages 492–510. Springer, 2022.
  70. Video self-stitching graph network for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13658–13667, 2021.
  71. Bottom-up temporal action localization with mutual regularization. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pages 539–555. Springer, 2020.
  72. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2914–2923, 2017.
  73. Enriching local and global contexts for temporal action localization. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13516–13525, 2021.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.