Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CaptainCook4D: A Dataset for Understanding Errors in Procedural Activities (2312.14556v4)

Published 22 Dec 2023 in cs.CV

Abstract: Following step-by-step procedures is an essential component of various activities carried out by individuals in their daily lives. These procedures serve as a guiding framework that helps to achieve goals efficiently, whether it is assembling furniture or preparing a recipe. However, the complexity and duration of procedural activities inherently increase the likelihood of making errors. Understanding such procedural activities from a sequence of frames is a challenging task that demands an accurate interpretation of visual information and the ability to reason about the structure of the activity. To this end, we collect a new egocentric 4D dataset, CaptainCook4D, comprising 384 recordings (94.5 hours) of people performing recipes in real kitchen environments. This dataset consists of two distinct types of activity: one in which participants adhere to the provided recipe instructions and another in which they deviate and induce errors. We provide 5.3K step annotations and 10K fine-grained action annotations and benchmark the dataset for the following tasks: supervised error recognition, multistep localization, and procedure learning

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. My View is the Best View: Procedure Learning from Egocentric Videos. European Conference on Computer Vision, july 2022.
  2. Weakly supervised action labeling in videos under ordering constraints. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, pages 628–643, Cham, 2014. Springer International Publishing.
  3. D3TW: discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 3546–3555. Computer Vision Foundation / IEEE, 2019.
  4. procedure planning in instructional videos. European Conference on Computer Vision, 2019.
  5. Development and evaluation of an ecological task to assess executive functioning post childhood tbi: The children’s cooking task. Brain Impairment, 11(2):125–143, 2010.
  6. The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines. arXiv: Computer Vision and Pattern Recognition, April 2020. ARXIV_ID: 2005.00343 MAG ID: 3022491006 S2ID: 1badccbe4a3cbf8662b924a97bbeea14fe2f1ac7.
  7. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100. International Journal of Computer Vision, October 2021.
  8. Guide to the carnegie mellon university multimodal activity (cmu-mmac) database. In Tech. report CMU-RI-TR-08-22, Robotics Institute, Carnegie Mellon University, April 2008.
  9. HoloLens 2 Sensor Streaming. Cornell University - arXiv, November 2022. ARXIV_ID: 2211.02648 MAG ID: 4308505718 S2ID: b19229b4f8667dae5017cae4df5c37086332da17.
  10. Bruce Draper. DARPA’s Perceptually-enabled Task Guidance (PTG) program, 2021.
  11. Graph2Vid: Flow graph to Video Grounding for Weakly-supervised Multi-Step Localization. Cornell University - arXiv, October 2022.
  12. Temporal cycle-consistency learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  13. Self-supervised Multi-task Procedure Learning from Instructional Videos. European Conference on Computer Vision, 2020.
  14. Self-supervised multi-task procedure learning from instructional videos. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XVII, volume 12362 of Lecture Notes in Computer Science, pages 557–573. Springer, 2020.
  15. Unsupervised procedure learning via joint dynamic summarization. International Conference on Computer Vision (ICCV), 2019.
  16. Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970, 2015.
  17. Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities. Computer Vision and Pattern Recognition, 2022.
  18. Learning Video Models from Text: Zero-Shot Anticipation for Procedural Actions. arXiv.org, 2021. S2ID: 3c0e77c5fb9e794336dec3872e686a91c0f653ee.
  19. Learning to recognize objects in egocentric activities. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’11, page 3281–3288, USA, 2011. IEEE Computer Society.
  20. Learning to recognize objects in egocentric activities. In CVPR 2011, pages 3281–3288. IEEE, June 2011.
  21. Christoph Feichtenhofer. X3D: Expanding Architectures for Efficient Video Recognition, April 2020.
  22. SlowFast Networks for Video Recognition, October 2019.
  23. Assessment of executive function in everyday life-psychometric properties of the norwegian adaptation of the children’s cooking task. Frontiers in human neuroscience, 15:761755, 2021.
  24. Daily performance of adolescents with executive function deficits: An empirical study using a complex-cooking task. Occupational therapy international, 2020:3051809, 2020.
  25. Weakly-supervised online action segmentation in multi-view instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13780–13790, 2023.
  26. Omnivore: A Single Model for Many Visual Modalities, March 2022.
  27. Ego4D: Around the World in 3,000 Hours of Egocentric Video. arXiv, October 2021.
  28. Ava: A video dataset of spatio-temporally localized atomic visual actions. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6047–6056, 2017.
  29. Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition, August 2017.
  30. Autonomously constructing hierarchical task networks for planning and human-robot collaboration. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 5469–5476, 2016.
  31. Deep Residual Learning for Image Recognition. arXiv, December 2015.
  32. P3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision. Computer Vision and Pattern Recognition, 2022.
  33. Connectionist temporal modeling for weakly supervised action labeling. CoRR, abs/1607.08584, 2016.
  34. Epic-tent: An egocentric video dataset for camping tent assembly. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Oct 2019.
  35. THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/, 2014.
  36. Procedure Planning in Instructional Videos via Contextual Modeling and Model-based Policy Learning. IEEE International Conference on Computer Vision, 2021.
  37. Adam: A Method for Stochastic Optimization, January 2017.
  38. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 780–787, 2014.
  39. In the eye of the beholder: Gaze and actions in first person video. CoRR, abs/2006.00626, 2020.
  40. Self-Supervised Masked Convolutional Transformer Block for Anomaly Detection, September 2022. arXiv:2209.12148 [cs].
  41. Action dynamics task graphs for learning plannable representations of procedural tasks, 2023.
  42. ActionCLIP: A New Paradigm for Video Action Recognition. arXiv.org, 2021. ARXIV_ID: 2109.08472 S2ID: dc05240a06326b5b1664f7e8c95c330b08cd0349.
  43. Leveraging the Present to Anticipate the Future in Videos. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), June 2019.
  44. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV, 2019.
  45. Learning program representations for food images and cooking recipes, 2022.
  46. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  47. SVIP: sequence verification for procedures in videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 19858–19870. IEEE, 2022.
  48. The MECCANO Dataset: Understanding Human-Object Interactions from Egocentric Videos in an Industrial-like Domain. arXiv: Computer Vision and Pattern Recognition, 2020.
  49. Rishi Hazra. EgoTV: Egocentric Task Verification from Natural Language Task Descriptions. arXiv.org, 2023. ARXIV_ID: 2303.16975 S2ID: 1901745a3a592f5026abd1e9d8435019a2a25585.
  50. Self-Supervised Predictive Convolutional Attentive Block for Anomaly Detection, March 2022. arXiv:2111.09099 [cs].
  51. Anticipative Video Transformer. IEEE International Conference on Computer Vision, October 2021.
  52. Recognizing fine-grained and composite activities using hand-centric features and script data. International Journal of Computer Vision, pages 1–28, 2015.
  53. IndustReal: A dataset for procedure step recognition handling execution errors in egocentric videos in an industrial-like setting, 2024.
  54. Zero-shot anticipation for instructional activities. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 862–871. IEEE, 2019.
  55. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision, 2016.
  56. Hollywood in homes: Crowdsourcing data collection for activity understanding. ArXiv e-prints, 2016.
  57. Combining embedded accelerometers with computer vision for recognizing food preparation activities. In UbiComp ’13: Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing, pages 729–738. Association for Computing Machinery, New York, NY, USA, September 2013.
  58. COIN: A Large-Scale Dataset for Comprehensive Instructional Video Analysis. Computer Vision and Pattern Recognition, June 2019.
  59. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Neural Information Processing Systems, 2022.
  60. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20270–20281, October 2023.
  61. Learning To Recognize Procedural Activities with Distant Supervision. Computer Vision and Pattern Recognition, 2022.
  62. English recipe flow graph corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5187–5194, Marseille, France, May 2020. European Language Resources Association.
  63. Every moment counts: Dense detailed labeling of actions in complex videos. International Journal of Computer Vision, 2017.
  64. Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval. arXiv.org, 2021.
  65. Actionformer: Localizing moments of actions with transformers. In Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IV, volume 13664 of Lecture Notes in Computer Science, pages 492–510. Springer, 2022.
  66. Cross-modal Contrastive Distillation for Instructional Activity Anticipation. International Conference on Pattern Recognition, 2022.
  67. Towards Automatic Learning of Procedures from Web Instructional Videos. arXiv, March 2017.
  68. Towards automatic learning of procedures from web instructional videos. In Sheila A. McIlraith and Kilian Q. Weinberger, editors, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 7590–7598. AAAI Press, 2018.
  69. Cross-task weakly supervised learning from instructional videos. arXiv: Computer Vision and Pattern Recognition, March 2019.
Citations (3)

Summary

We haven't generated a summary for this paper yet.