Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

IVAC-P2L: Leveraging Irregular Repetition Priors for Improving Video Action Counting (2403.11959v2)

Published 18 Mar 2024 in cs.CV, cs.AI, and cs.MM

Abstract: Video Action Counting (VAC) is crucial in analyzing sports, fitness, and everyday activities by quantifying repetitive actions in videos. However, traditional VAC methods have overlooked the complexity of action repetitions, such as interruptions and the variability in cycle duration. Our research addresses the shortfall by introducing a novel approach to VAC, called Irregular Video Action Counting (IVAC). IVAC prioritizes modeling irregular repetition patterns in videos, which we define through two primary aspects: Inter-cycle Consistency and Cycle-interval Inconsistency. Inter-cycle Consistency ensures homogeneity in the spatial-temporal representations of cycle segments, signifying action uniformity within cycles. Cycle-interval inconsistency highlights the importance of distinguishing between cycle segments and intervals based on their inherent content differences. To encapsulate these principles, we propose a new methodology that includes consistency and inconsistency modules, supported by a unique pull-push loss (P2L) mechanism. The IVAC-P2L model applies a pull loss to promote coherence among cycle segment features and a push loss to clearly distinguish features of cycle segments from interval segments. Empirical evaluations conducted on the RepCount dataset demonstrate that the IVAC-P2L model sets a new benchmark in VAC task performance. Furthermore, the model demonstrates exceptional adaptability and generalization across various video contents, outperforming existing models on two additional datasets, UCFRep and Countix, without the need for dataset-specific optimization. These results confirm the efficacy of our approach in addressing irregular repetitions in videos and pave the way for further advancements in video analysis and understanding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. A. Soro, G. Brunner, S. Tanner, and R. Wattenhofer, “Recognition and repetition counting for complex physical exercises with deep learning,” Sensors, vol. 19, no. 3, p. 714, 2019.
  2. K. Vats, M. Fani, P. Walters, D. A. Clausi, and J. Zelek, “Event detection in coarsely annotated sports videos via parallel multi-receptive field 1d convolutions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 882–883.
  3. H. Kang, J. Kim, T. Kim, and S. J. Kim, “Uboco: Unsupervised boundary contrastive learning for generic event boundary detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 20 073–20 082.
  4. C. Li, X. Wang, L. Wen, D. Hong, T. Luo, and L. Zhang, “End-to-end compressed video representation learning for generic event boundary detection,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 13 947–13 956.
  5. J. Tang, Z. Liu, C. Qian, W. Wu, and L. Wang, “Progressive attention on multi-level dense difference maps for generic event boundary detection,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 3345–3354.
  6. Y. Ran, I. Weiss, Q. Zheng, and L. S. Davis, “Pedestrian detection via periodic motion analysis,” International Journal of Computer Vision, vol. 71, no. 2, pp. 143–160, 2007.
  7. Y. Zhang, H. He, J. Li, Y. Li, J. See, and W. Lin, “Variational pedestrian detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 11 622–11 631.
  8. J. P. Lima, R. Roberto, L. Figueiredo, F. Simoes, and V. Teichrieb, “Generalizable multi-camera 3d pedestrian detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2021, pp. 1232–1240.
  9. E. Ribnick and N. Papanikolopoulos, “3d reconstruction of periodic motion from a single view,” International Journal of Computer Vision, vol. 90, no. 1, pp. 28–44, 2010.
  10. E. Ribnick, R. Sivalingam, N. Papanikolopoulos, and K. Daniilidis, “Reconstructing and analyzing periodic human motion from stationary monocular views,” Computer Vision and Image Understanding, vol. 116, no. 7, pp. 815–826, 2012.
  11. B. Wandt, H. Ackermann, and B. Rosenhahn, “3d reconstruction of human motion from monocular image sequences,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 8, pp. 1505–1516, 2016.
  12. X. Li, H. Li, H. Joo, Y. Liu, and Y. Sheikh, “Structure from recurrent motion: From rigidity to recurrency,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3032–3040.
  13. O. Levy and L. Wolf, “Live repetition counting,” international conference on computer vision, 2015.
  14. A. Briassouli and N. Ahuja, “Extraction and analysis of multiple periodic motions in video sequences,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007.
  15. R. Cutler and L. S. Davis, “Robust real-time periodic motion detection, analysis, and applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000.
  16. X. Li, V. K. Singh, Y. Wu, K. J. Kirchberg, J. S. Duncan, and A. Kapoor, “Repetitive motion estimation network: Recover cardiac and respiratory signal from thoracic imaging.” arXiv: Computer Vision and Pattern Recognition, 2018.
  17. H. Zhang, X. Xu, G. Han, and S. He, “Context-aware and scale-insensitive temporal repetition counting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 670–678.
  18. D. Dwibedi, Y. Aytar, J. Tompson, P. Sermanet, and A. Zisserman, “Counting out time: Class agnostic video repetition counting in the wild,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 387–10 396.
  19. H. Hu, S. Dong, Y. Zhao, D. Lian, Z. Li, and S. Gao, “Transrac: Encoding multi-scale temporal correlation with transformers for repetitive action counting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 013–19 022.
  20. C.-Y. Wu, Y. Li, K. Mangalam, H. Fan, B. Xiong, J. Malik, and C. Feichtenhofer, “Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 587–13 597.
  21. Y. Wang, Y. Yue, Y. Lin, H. Jiang, Z. Lai, V. Kulikov, N. Orlov, H. Shi, and G. Huang, “Adafocus v2: End-to-end training of spatial dynamic networks for video recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 20 062–20 072.
  22. Y. Kong, Y. Wang, and A. Li, “Spatiotemporal saliency representation learning for video action recognition,” IEEE Transactions on Multimedia, vol. 24, pp. 1515–1528, 2022.
  23. S. Chen, P. Sun, E. Xie, C. Ge, J. Wu, L. Ma, J. Shen, and P. Luo, “Watch only once: An end-to-end video action detection framework,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8178–8187.
  24. J. Zhao, Y. Zhang, X. Li, H. Chen, B. Shuai, M. Xu, C. Liu, K. Kundu, Y. Xiong, D. Modolo et al., “Tuber: Tubelet transformer for video action detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 598–13 607.
  25. M.-G. Gan and Y. Zhang, “Temporal attention-pyramid pooling for temporal action detection,” IEEE Transactions on Multimedia, vol. 25, pp. 3799–3810, 2023.
  26. G. Nan, R. Qiao, Y. Xiao, J. Liu, S. Leng, H. Zhang, and W. Lu, “Interventional video grounding with dual contrastive learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 2765–2775.
  27. C. Sun, H. Song, X. Wu, Y. Jia, and J. Luo, “Exploiting informative video segments for temporal action localization,” IEEE Transactions on Multimedia, vol. 24, pp. 274–287, 2022.
  28. K. Xia, L. Wang, Y. Shen, S. Zhou, G. Hua, and W. Tang, “Exploring action centers for temporal action localization,” IEEE Transactions on Multimedia, vol. 25, pp. 9425–9436, 2023.
  29. X. Fang, D. Liu, P. Zhou, Z. Xu, and R. Li, “Hierarchical local-global transformer for temporal sentence grounding,” IEEE Transactions on Multimedia, vol. 26, pp. 3263–3277, 2024.
  30. Z. Tang, Y. Liao, S. Liu, G. Li, X. Jin, H. Jiang, Q. Yu, and D. Xu, “Human-centric spatio-temporal video grounding with visual transformers,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 12, pp. 8238–8249, 2022.
  31. R. Su, Q. Yu, and D. Xu, “Stvgbert: A visual-linguistic transformer based framework for spatio-temporal video grounding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 1533–1542.
  32. A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid, “Tubedetr: Spatio-temporal video grounding with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 442–16 453.
  33. O. Levy and L. Wolf, “Live repetition counting,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 3020–3028.
  34. T. F. Runia, C. G. Snoek, and A. W. Smeulders, “Real-world repetition estimation by div, grad and curl,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9009–9017.
  35. J. Yin, Y. Wu, C. Zhu, Z. Yin, H. Liu, Y. Dang, Z. Liu, and J. Liu, “Energy-based periodicity mining with deep features for action repetition counting in unconstrained videos,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 12, pp. 4812–4825, 2021.
  36. N. Jacquelin, R. Vuillemot, and S. Duffner, “Periodicity counting in videos with unsupervised learning of cyclic embeddings,” Pattern Recognition Letters, vol. 161, pp. 59–66, 2022.
  37. B. Ferreira, P. M. Ferreira, G. Pinheiro, N. Figueiredo, F. Carvalho, P. Menezes, and J. Batista, “Deep learning approaches for workout repetition counting and validation,” Pattern Recognition Letters, vol. 151, pp. 259–266, 2021.
  38. Z. Yao, X. Cheng, and Y. Zou, “Poserac: Pose saliency transformer for repetitive action counting,” arXiv preprint arXiv:2303.08450, 2023.
  39. Y. Zhang, L. Shao, and C. G. Snoek, “Repetitive activity counting by sight and sound,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 070–14 079.
  40. B. Dai and D. Lin, “Contrastive learning for image captioning,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  41. J. Zhang, K. J. Shih, A. Elgammal, A. Tao, and B. Catanzaro, “Graphical contrastive losses for scene graph parsing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 535–11 543.
  42. T. Park, A. A. Efros, R. Zhang, and J.-Y. Zhu, “Contrastive learning for unpaired image-to-image translation,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16.   Springer, 2020, pp. 319–345.
  43. A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, and N. Carion, “Mdetr-modulated detection for end-to-end multi-modal understanding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1780–1790.
  44. T. Pan, Y. Song, T. Yang, W. Jiang, and W. Liu, “Videomoco: Contrastive video representation learning with temporally adversarial examples,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 205–11 214.
  45. R. Qian, T. Meng, B. Gong, M.-H. Yang, H. Wang, S. Belongie, and Y. Cui, “Spatiotemporal contrastive video representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6964–6974.
  46. Z. Zeng, D. McDuff, Y. Song et al., “Contrastive learning of global and local video representations,” Advances in Neural Information Processing Systems, vol. 34, pp. 7025–7040, 2021.
  47. Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3202–3211.
  48. H. Guo, “A simple algorithm for fitting a gaussian function [dsp tips and tricks],” IEEE Signal Processing Magazine, vol. 28, no. 5, pp. 134–137, 2011.
  49. K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
  50. C. Feichtenhofer, “X3d: Expanding architectures for efficient video recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 203–213.
  51. Z. Liu, L. Wang, W. Wu, C. Qian, and T. Lu, “Tam: Temporal adaptive module for video recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 708–13 718.
  52. Y. Huang, Y. Sugano, and Y. Sato, “Improving action segmentation via graph-based temporal reasoning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 14 024–14 034.
  53. R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2.   IEEE, 2006, pp. 1735–1742.
  54. F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
  55. A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
Citations (1)

Summary

We haven't generated a summary for this paper yet.