Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Streaming egocentric action anticipation: An evaluation scheme and approach (2306.16682v1)

Published 29 Jun 2023 in cs.CV

Abstract: Egocentric action anticipation aims to predict the future actions the camera wearer will perform from the observation of the past. While predictions about the future should be available before the predicted events take place, most approaches do not pay attention to the computational time required to make such predictions. As a result, current evaluation schemes assume that predictions are available right after the input video is observed, i.e., presuming a negligible runtime, which may lead to overly optimistic evaluations. We propose a streaming egocentric action evaluation scheme which assumes that predictions are performed online and made available only after the model has processed the current input segment, which depends on its runtime. To evaluate all models considering the same prediction horizon, we hence propose that slower models should base their predictions on temporal segments sampled ahead of time. Based on the observation that model runtime can affect performance in the considered streaming evaluation scenario, we further propose a lightweight action anticipation model based on feed-forward 3D CNNs which is optimized using knowledge distillation techniques with a novel past-to-future distillation loss. Experiments on the three popular datasets EPIC-KITCHENS-55, EPIC-KITCHENS-100 and EGTEA Gaze+ show that (i) the proposed evaluation scheme induces a different ranking on state-of-the-art methods as compared to classic evaluations, (ii) lightweight approaches tend to outmatch more computationally expensive ones, and (iii) the proposed model based on feed-forward 3D CNNs and knowledge distillation outperforms current art in the streaming egocentric action anticipation scenario.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. When will you do what?-anticipating temporal occurrences of activities, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5343–5352.
  2. Variational information distillation for knowledge transfer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9163–9171.
  3. Prediction, cognition and the brain. Frontiers in human neuroscience 4, 25.
  4. Knowledge distillation for action anticipation via label smoothing, in: International Conference on Pattern Recognition (ICPR), pp. 3312–3319.
  5. Quo vadis, action recognition? a new model and the kinetics dataset, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733. doi:10.1109/CVPR.2017.502.
  6. On the efficacy of knowledge distillation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4794–4802.
  7. Mars: Motion-augmented rgb stream for action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7882–7891.
  8. The epic-kitchens dataset: Collection, challenges and baselines. IEEE Transactions on Pattern Analysis and Machine Intelligence , 1–1doi:10.1109/TPAMI.2020.2991965.
  9. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal on Computer Vision (IJCV) 130, 33–55.
  10. Online action detection, in: European Conference on Computer Vision, Springer. pp. 269–284.
  11. Forecasting action through contact representations from first person video. IEEE Transactions on Pattern Analysis and Machine Intelligence .
  12. Time-compressed preplay of anticipated events in human primary visual cortex. Nature Communications 8, 1–9.
  13. X3d: Expanding architectures for efficient video recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 203–213.
  14. Slowfast networks for video recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211.
  15. Anticipating human actions by correlating past with the future with Jaccard similarity measures. CVPR .
  16. Next-active-object prediction from egocentric videos. Journal of Visual Communication and Image Representation 49, 401–411.
  17. Leveraging uncertainty to rethink loss functions and evaluation measures for egocentric action anticipation, in: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 389–405.
  18. Rolling-unrolling lstms for action anticipation from first-person video. IEEE transactions on pattern analysis and machine intelligence 43, 4021–4036.
  19. RED: Reinforced encoder-decoder networks for action anticipation, in: British Machine Vision Conference, p. .
  20. Learning to forget: continual prediction with lstm, in: 1999 Ninth International Conference on Artificial Neural Networks ICANN 99. (Conf. Publ. No. 470), pp. 850–855 vol.2. doi:10.1049/cp:19991218.
  21. Anticipative Video Transformer, in: International Conference on Computer Vision, pp. 13505–13515.
  22. Knowledge distillation: A survey. International Journal of Computer Vision , 1–31.
  23. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence 37, 1904–1916.
  24. Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.
  25. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 .
  26. Max-margin early event detectors. International Journal of Computer Vision 107, 191–202.
  27. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 .
  28. Action-reaction: Forecasting the dynamics of human interaction, in: European Conference on Computer Vision, pp. 489–504.
  29. Like what you like: Knowledge distill via neuron selectivity transfer.
  30. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5 mb model size. arXiv preprint arXiv:1602.07360 .
  31. First-person vision. Proceedings of the IEEE 100, 2442–2453.
  32. Time-conditioned action anticipation in one shot, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9925–9934.
  33. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
  34. Activity forecasting, in: European Conference on Computer Vision, Springer. pp. 201–214.
  35. Anticipating human activities using object affordances for reactive robotic response. IEEE transactions on pattern analysis and machine intelligence 38, 14–29.
  36. The visual object tracking vot2017 challenge results, in: Proceedings of the IEEE international conference on computer vision workshops, pp. 1949–1972.
  37. A hierarchical representation for future action prediction, in: European Conference on Computer Vision, pp. 689–704.
  38. Towards streaming perception, in: European Conference on Computer Vision, Springer. pp. 473–488.
  39. In the eye of the beholder: Gaze and actions in first person video. IEEE Transactions on Pattern Analysis and Machine Intelligence .
  40. Focal loss for dense object detection, in: Proceedings of the IEEE international conference on computer vision, pp. 2980–2988.
  41. Attention distillation for learning video representations, in: British Machine Vision Conference, p. .
  42. Forecasting human-object interaction: joint prediction of motor attention and actions in first person video, in: European Conference on Computer Vision, Springer. pp. 704–721.
  43. Ssd: Single shot multibox detector, in: European conference on computer vision, Springer. pp. 21–37.
  44. Joint prediction of activity labels and starting times in untrimmed videos, in: International Conference on Computer Vision, pp. 5773–5782.
  45. A variational auto-encoder model for stochastic point processes, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3165–3174.
  46. Leveraging the present to anticipate the future in videos, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 2915–2922.
  47. Forecasting future action sequences with attention: a new approach to weakly supervised action forecasting. IEEE Transactions on Image Processing 29, 8880–8891.
  48. Heterogeneous knowledge distillation using information flow modeling, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2339–2348.
  49. Pytorch: An imperative style, high-performance deep learning library, in: Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R. (Eds.), Advances in Neural Information Processing Systems 32. Curran Associates, Inc., pp. 8024–8035.
  50. Parsing video events with goal inference and intent prediction, in: 2011 International Conference on Computer Vision, IEEE. pp. 487–494.
  51. Adversarial generative grammars for human activity prediction, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, Springer. pp. 507–523.
  52. Self-regulated learning for egocentric video activity anticipation. IEEE Transactions on Pattern Analysis and Machine Intelligence .
  53. Xnor-net: Imagenet classification using binary convolutional neural networks, in: European conference on computer vision, Springer. pp. 525–542.
  54. You only look once: Unified, real-time object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788.
  55. Yolo9000: better, faster, stronger, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271.
  56. Faster r-cnn: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence 39, 1137–1149.
  57. Predicting the future from first person (egocentric) vision: A survey. Computer Vision and Image Understanding .
  58. Untrimmed action anticipation, in: International Conference on Image Analysis and Processing (ICIAP).
  59. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550 .
  60. Encouraging lstms to anticipate actions very early, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 280–289.
  61. Temporal aggregate representations for long-range video understanding, in: European Conference on Computer Vision, Springer. pp. 154–171.
  62. Very deep convolutional networks for large-scale image recognition, in: International Conference on Learning Representations, p. .
  63. Generating notifications for missing actions: Don’t forget to turn the lights off!, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 4669–4677.
  64. D3d: Distilled 3d networks for video action recognition, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 625–634.
  65. Going deeper with convolutions, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9.
  66. Efficientnet: Rethinking model scaling for convolutional neural networks, in: International Conference on Machine Learning, PMLR. pp. 6105–6114.
  67. A closer look at spatiotemporal convolutions for action recognition, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459.
  68. Back to the future: Knowledge distillation for human action anticipation. arXiv preprint arXiv:1904.04868 .
  69. Anticipating visual representations from unlabeled video, in: IEEE Conference on Computer Vision and Pattern Recognition, pp. 98–106.
  70. Temporal segment networks: Towards good practices for deep action recognition, in: European conference on computer vision, Springer. pp. 20–36.
  71. Progressive teacher-student learning for early action prediction, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3551–3560.
  72. Learning to anticipate egocentric actions by imagination. IEEE Transactions on Image Processing 30, 1143–1152.
  73. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4133–4141.
  74. Multi-modal temporal convolutional network for anticipating actions in egocentric videos, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2249–2258.
  75. An egocentric action anticipation framework via fusing intuition and analysis, in: Proceedings of the 28th ACM International Conference on Multimedia, pp. 402–410.
  76. On diverse asynchronous activity anticipation, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX 16, Springer. pp. 781–799.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Antonino Furnari (46 papers)
  2. Giovanni Maria Farinella (50 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.