CaptainCook4D: A Dataset for Understanding Errors in Procedural Activities (2312.14556v4)
Abstract: Following step-by-step procedures is an essential component of various activities carried out by individuals in their daily lives. These procedures serve as a guiding framework that helps to achieve goals efficiently, whether it is assembling furniture or preparing a recipe. However, the complexity and duration of procedural activities inherently increase the likelihood of making errors. Understanding such procedural activities from a sequence of frames is a challenging task that demands an accurate interpretation of visual information and the ability to reason about the structure of the activity. To this end, we collect a new egocentric 4D dataset, CaptainCook4D, comprising 384 recordings (94.5 hours) of people performing recipes in real kitchen environments. This dataset consists of two distinct types of activity: one in which participants adhere to the provided recipe instructions and another in which they deviate and induce errors. We provide 5.3K step annotations and 10K fine-grained action annotations and benchmark the dataset for the following tasks: supervised error recognition, multistep localization, and procedure learning
- My View is the Best View: Procedure Learning from Egocentric Videos. European Conference on Computer Vision, july 2022.
- Weakly supervised action labeling in videos under ordering constraints. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, pages 628–643, Cham, 2014. Springer International Publishing.
- D3TW: discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 3546–3555. Computer Vision Foundation / IEEE, 2019.
- procedure planning in instructional videos. European Conference on Computer Vision, 2019.
- Development and evaluation of an ecological task to assess executive functioning post childhood tbi: The children’s cooking task. Brain Impairment, 11(2):125–143, 2010.
- The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines. arXiv: Computer Vision and Pattern Recognition, April 2020. ARXIV_ID: 2005.00343 MAG ID: 3022491006 S2ID: 1badccbe4a3cbf8662b924a97bbeea14fe2f1ac7.
- Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100. International Journal of Computer Vision, October 2021.
- Guide to the carnegie mellon university multimodal activity (cmu-mmac) database. In Tech. report CMU-RI-TR-08-22, Robotics Institute, Carnegie Mellon University, April 2008.
- HoloLens 2 Sensor Streaming. Cornell University - arXiv, November 2022. ARXIV_ID: 2211.02648 MAG ID: 4308505718 S2ID: b19229b4f8667dae5017cae4df5c37086332da17.
- Bruce Draper. DARPA’s Perceptually-enabled Task Guidance (PTG) program, 2021.
- Graph2Vid: Flow graph to Video Grounding for Weakly-supervised Multi-Step Localization. Cornell University - arXiv, October 2022.
- Temporal cycle-consistency learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
- Self-supervised Multi-task Procedure Learning from Instructional Videos. European Conference on Computer Vision, 2020.
- Self-supervised multi-task procedure learning from instructional videos. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XVII, volume 12362 of Lecture Notes in Computer Science, pages 557–573. Springer, 2020.
- Unsupervised procedure learning via joint dynamic summarization. International Conference on Computer Vision (ICCV), 2019.
- Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970, 2015.
- Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities. Computer Vision and Pattern Recognition, 2022.
- Learning Video Models from Text: Zero-Shot Anticipation for Procedural Actions. arXiv.org, 2021. S2ID: 3c0e77c5fb9e794336dec3872e686a91c0f653ee.
- Learning to recognize objects in egocentric activities. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’11, page 3281–3288, USA, 2011. IEEE Computer Society.
- Learning to recognize objects in egocentric activities. In CVPR 2011, pages 3281–3288. IEEE, June 2011.
- Christoph Feichtenhofer. X3D: Expanding Architectures for Efficient Video Recognition, April 2020.
- SlowFast Networks for Video Recognition, October 2019.
- Assessment of executive function in everyday life-psychometric properties of the norwegian adaptation of the children’s cooking task. Frontiers in human neuroscience, 15:761755, 2021.
- Daily performance of adolescents with executive function deficits: An empirical study using a complex-cooking task. Occupational therapy international, 2020:3051809, 2020.
- Weakly-supervised online action segmentation in multi-view instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13780–13790, 2023.
- Omnivore: A Single Model for Many Visual Modalities, March 2022.
- Ego4D: Around the World in 3,000 Hours of Egocentric Video. arXiv, October 2021.
- Ava: A video dataset of spatio-temporally localized atomic visual actions. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6047–6056, 2017.
- Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition, August 2017.
- Autonomously constructing hierarchical task networks for planning and human-robot collaboration. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 5469–5476, 2016.
- Deep Residual Learning for Image Recognition. arXiv, December 2015.
- P3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision. Computer Vision and Pattern Recognition, 2022.
- Connectionist temporal modeling for weakly supervised action labeling. CoRR, abs/1607.08584, 2016.
- Epic-tent: An egocentric video dataset for camping tent assembly. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Oct 2019.
- THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/, 2014.
- Procedure Planning in Instructional Videos via Contextual Modeling and Model-based Policy Learning. IEEE International Conference on Computer Vision, 2021.
- Adam: A Method for Stochastic Optimization, January 2017.
- The language of actions: Recovering the syntax and semantics of goal-directed human activities. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 780–787, 2014.
- In the eye of the beholder: Gaze and actions in first person video. CoRR, abs/2006.00626, 2020.
- Self-Supervised Masked Convolutional Transformer Block for Anomaly Detection, September 2022. arXiv:2209.12148 [cs].
- Action dynamics task graphs for learning plannable representations of procedural tasks, 2023.
- ActionCLIP: A New Paradigm for Video Action Recognition. arXiv.org, 2021. ARXIV_ID: 2109.08472 S2ID: dc05240a06326b5b1664f7e8c95c330b08cd0349.
- Leveraging the Present to Anticipate the Future in Videos. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), June 2019.
- HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV, 2019.
- Learning program representations for food images and cooking recipes, 2022.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- SVIP: sequence verification for procedures in videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 19858–19870. IEEE, 2022.
- The MECCANO Dataset: Understanding Human-Object Interactions from Egocentric Videos in an Industrial-like Domain. arXiv: Computer Vision and Pattern Recognition, 2020.
- Rishi Hazra. EgoTV: Egocentric Task Verification from Natural Language Task Descriptions. arXiv.org, 2023. ARXIV_ID: 2303.16975 S2ID: 1901745a3a592f5026abd1e9d8435019a2a25585.
- Self-Supervised Predictive Convolutional Attentive Block for Anomaly Detection, March 2022. arXiv:2111.09099 [cs].
- Anticipative Video Transformer. IEEE International Conference on Computer Vision, October 2021.
- Recognizing fine-grained and composite activities using hand-centric features and script data. International Journal of Computer Vision, pages 1–28, 2015.
- IndustReal: A dataset for procedure step recognition handling execution errors in egocentric videos in an industrial-like setting, 2024.
- Zero-shot anticipation for instructional activities. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 862–871. IEEE, 2019.
- Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision, 2016.
- Hollywood in homes: Crowdsourcing data collection for activity understanding. ArXiv e-prints, 2016.
- Combining embedded accelerometers with computer vision for recognizing food preparation activities. In UbiComp ’13: Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing, pages 729–738. Association for Computing Machinery, New York, NY, USA, September 2013.
- COIN: A Large-Scale Dataset for Comprehensive Instructional Video Analysis. Computer Vision and Pattern Recognition, June 2019.
- Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Neural Information Processing Systems, 2022.
- Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20270–20281, October 2023.
- Learning To Recognize Procedural Activities with Distant Supervision. Computer Vision and Pattern Recognition, 2022.
- English recipe flow graph corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5187–5194, Marseille, France, May 2020. European Language Resources Association.
- Every moment counts: Dense detailed labeling of actions in complex videos. International Journal of Computer Vision, 2017.
- Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval. arXiv.org, 2021.
- Actionformer: Localizing moments of actions with transformers. In Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IV, volume 13664 of Lecture Notes in Computer Science, pages 492–510. Springer, 2022.
- Cross-modal Contrastive Distillation for Instructional Activity Anticipation. International Conference on Pattern Recognition, 2022.
- Towards Automatic Learning of Procedures from Web Instructional Videos. arXiv, March 2017.
- Towards automatic learning of procedures from web instructional videos. In Sheila A. McIlraith and Kilian Q. Weinberger, editors, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 7590–7598. AAAI Press, 2018.
- Cross-task weakly supervised learning from instructional videos. arXiv: Computer Vision and Pattern Recognition, March 2019.