Papers
Topics
Authors
Recent
2000 character limit reached

Learning to Visually Connect Actions and their Effects (2401.10805v3)

Published 19 Jan 2024 in cs.CV, cs.AI, cs.LG, and cs.RO

Abstract: We introduce the novel concept of visually Connecting Actions and Their Effects (CATE) in video understanding. CATE can have applications in areas like task planning and learning from demonstration. We identify and explore two different aspects of the concept of CATE: Action Selection (AS) and Effect-Affinity Assessment (EAA), where video understanding models connect actions and effects at semantic and fine-grained levels, respectively. We design various baseline models for AS and EAA. Despite the intuitive nature of the task, we observe that models struggle, and humans outperform them by a large margin. Our experiments show that in solving AS and EAA, models learn intuitive properties like object tracking and pose encoding without explicit supervision. We demonstrate that CATE can be an effective self-supervised task for learning video representations from unlabeled videos. The study aims to showcase the fundamental nature and versatility of CATE, with the hope of inspiring advanced formulations and models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Action quality assessment with temporal parsing transformer. In European Conference on Computer Vision, pages 422–438. Springer, 2022.
  2. Speednet: Learning the speediness in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9922–9931, 2020.
  3. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  4. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  5. Temporal action segmentation: An analysis of modern techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  6. Hierarchical modeling for task recognition and action segmentation in weakly-labeled instructional videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1922–1932, 2022.
  7. An ecological approach to perceptual learning and development. 2000.
  8. James J Gibson. The ecological approach to visual perception: Classic edition. 2014.
  9. Cater: A diagnostic dataset for compositional actions and temporal reasoning. arXiv preprint arXiv:1910.04744, 2019.
  10. The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pages 5842–5850, 2017.
  11. Memory-augmented dense predictive coding for video representation learning. In European conference on computer vision, pages 312–329. Springer, 2020.
  12. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
  13. The abduction of sherlock holmes: A dataset for visual abductive reasoning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pages 558–575. Springer, 2022.
  14. Learning with side information through modality hallucination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 826–834, 2016.
  15. Transformation driven visual reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6903–6912, 2021.
  16. Video representation learning by recognizing temporal transformations. In European Conference on Computer Vision, pages 425–442. Springer, 2020.
  17. Learning to describe differences between pairs of similar images. arXiv preprint arXiv:1808.10584, 2018.
  18. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017.
  19. Action tubelet detector for spatio-temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 4405–4413, 2017.
  20. Agnostic change captioning with cycle consistency. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2095–2104, 2021.
  21. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  22. Visual abductive reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15565–15575, 2022.
  23. Bilinear cnn models for fine-grained visual recognition. In Proceedings of the IEEE international conference on computer vision, pages 1449–1457, 2015.
  24. Tcgl: Temporal contrastive graph for self-supervised video representation learning. IEEE Transactions on Image Processing, 31:1978–1993, 2022a.
  25. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022b.
  26. Video cloze procedure for self-supervised spatio-temporal learning. In Proceedings of the AAAI conference on artificial intelligence, pages 11701–11708, 2020.
  27. Attributes as operators: factorizing unseen attribute-object compositions. In Proceedings of the European Conference on Computer Vision (ECCV), pages 169–185, 2018.
  28. Robust change captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4624–4633, 2019.
  29. Action quality assessment across multiple actions. In 2019 IEEE winter conference on applications of computer vision (WACV), pages 1468–1476. IEEE, 2019a.
  30. What and how well you performed? a multitask learning approach to action quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 304–313, 2019b.
  31. Domain knowledge-informed self-supervised representations for workout form assessment. In European Conference on Computer Vision, pages 105–123. Springer, 2022.
  32. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
  33. A generalized and robust framework for timestamp supervision in temporal action segmentation. In European Conference on Computer Vision, pages 279–296. Springer, 2022.
  34. Towards improved and interpretable action quality assessment with self-supervised alignment. In The 14th PErvasive Technologies Related to Assistive Environments Conference, pages 507–513, 2021.
  35. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  36. Look for the change: Learning object states and state-modifying actions from untrimmed web videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13956–13966, 2022.
  37. Learning video representations using contrastive bidirectional transformer. arXiv preprint arXiv:1906.05743, 2019.
  38. Coin: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207–1216, 2019.
  39. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
  40. Self-supervised video representation learning by pace prediction. In ECCV, 2020.
  41. Self-supervised video representation learning by uncovering spatio-temporal statistics. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  42. Actions~ transformations. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2658–2667, 2016.
  43. Star: A benchmark for situated reasoning in real-world videos. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  44. Simple baselines for human pose estimation and tracking. In European Conference on Computer Vision (ECCV), 2018.
  45. Identity from here, pose from there: Self-supervised disentanglement and generation of objects using unlabeled videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
  46. Self-supervised spatiotemporal learning via video clip order prediction. In Computer Vision and Pattern Recognition (CVPR), 2019.
  47. Video playback rate perception for self-supervised spatio-temporal representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6548–6557, 2020.
  48. Action anticipation with goal consistency. In 2023 IEEE International Conference on Image Processing (ICIP), pages 1630–1634. IEEE, 2023.
  49. Barlow twins: Self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230, 2021.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 0 likes about this paper.