PREGO: online mistake detection in PRocedural EGOcentric videos (2404.01933v2)
Abstract: Promptly identifying procedural errors from egocentric videos in an online setting is highly challenging and valuable for detecting mistakes as soon as they happen. This capability has a wide range of applications across various fields, such as manufacturing and healthcare. The nature of procedural mistakes is open-set since novel types of failures might occur, which calls for one-class classifiers trained on correctly executed procedures. However, no technique can currently detect open-set procedural mistakes online. We propose PREGO, the first online one-class classification model for mistake detection in PRocedural EGOcentric videos. PREGO is based on an online action recognition component to model the current action, and a symbolic reasoning module to predict the next actions. Mistake detection is performed by comparing the recognized current action with the expected future one. We evaluate PREGO on two procedural egocentric video datasets, Assembly101 and Epic-tent, which we adapt for online benchmarking of procedural mistake detection to establish suitable benchmarks, thus defining the Assembly101-O and Epic-tent-O datasets, respectively.
- Gepsan: Generative procedure step anticipation in cooking videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2988–2997, 2023.
- Miniroad: Minimal rnn framework for online action detection. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 10307–10316, 2023.
- Video-mined task graphs for keystep recognition in instructional videos. arXiv preprint arXiv:2307.08763, 2023.
- The ikea asm dataset: Understanding people assembling furniture through actions, objects and pose. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 847–859, 2021.
- Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
- Scaling egocentric vision: The epic-kitchens dataset. In European Conference on Computer Vision (ECCV), 2018.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019.
- Every mistake counts in assembly. arXiv preprint arXiv:2307.16453, 2023.
- Unsupervised procedure learning via joint dynamic summarization. 2019.
- Language models can be logical solvers. ArXiv, abs/2311.06158, 2023.
- Multimodal motion conditioned diffusion model for skeleton-based video anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10318–10329, 2023.
- Weakly-supervised action segmentation and unseen error detection in anomalous instructional videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10128–10138, 2023.
- Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023.
- Epic-tent: An egocentric video dataset for camping tent assembly. In Int. Conf. Comput. Vis., 2019.
- The language of actions: Recovering the syntax and semantics of goal-directed human activities. In Proceedings of Computer Vision and Pattern Recognition Conference (CVPR), 2014.
- Code as policies: Language model programs for embodied control. 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500, 2022.
- Set-supervised action learning in procedural task videos via pairwise order consistency. In IEEE Conf. Comput. Vis. Pattern Recog., pages 19903–19913, 2022.
- HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV, 2019.
- Large language models as general pattern machines. In Proceedings of the 7th Conference on Robot Learning (CoRL), 2023.
- Gpt3-to-plan: Extracting plans from text using gpt-3. FinPlan 2021, page 24, 2021.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Plansformer: Generating symbolic plans using transformers. ArXiv, abs/2212.08681, 2022.
- An outlook into the future of egocentric vision. ArXiv, abs/2308.07123, 2023.
- Self-regulated learning for egocentric video activity anticipation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):6715–6730, 2023.
- Svip: Sequence verification for procedures in videos. In IEEE Conf. Comput. Vis. Pattern Recog., pages 19890–19902, 2022.
- The meccano dataset: Understanding human-object interactions from egocentric videos in an industrial-like domain. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1569–1578, 2021.
- Enigma-51: Towards a fine-grained understanding of human-object interactions in industrial scenarios. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024.
- Industreal: A dataset for procedure step recognition handling execution errors in egocentric videos in an industrial-like setting. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4365–4374, 2024.
- Assembly101: A large-scale multi-view video dataset for understanding procedural activities. IEEE Conf. Comput. Vis. Pattern Recog., 2022.
- Steps: Self-supervised key step extraction and localization from unlabeled procedural videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10375–10387, 2023.
- Combining embedded accelerometers with computer vision for recognizing food preparation activities. In Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing. Association for Computing Machinery, 2013.
- Vipergpt: Visual inference via python execution for reasoning. Proceedings of IEEE International Conference on Computer Vision (ICCV), 2023.
- Coin: A large-scale dataset for comprehensive instructional video analysis. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1207–1216, 2019.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Oadtr: Online action detection with transformers. In Int. Conf. Comput. Vis., pages 7565–7575, 2021.
- Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. In Int. Conf. Comput. Vis., pages 20270–20281, 2023.
- Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022, 2022.
- Larger language models do in-context learning differently. ArXiv, abs/2303.03846, 2023.
- Generative cooperative learning for unsupervised video anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14744–14754, 2022.
- Learning procedure-aware video representation from instructional videos and their narrations. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, 2023.
- Cross-task weakly supervised learning from instructional videos. In CVPR, 2019.