O-TALC: Steps Towards Combating Oversegmentation within Online Action Segmentation (2404.06894v1)
Abstract: Online temporal action segmentation shows a strong potential to facilitate many HRI tasks where extended human action sequences must be tracked and understood in real time. Traditional action segmentation approaches, however, operate in an offline two stage approach, relying on computationally expensive video wide features for segmentation, rendering them unsuitable for online HRI applications. In order to facilitate online action segmentation on a stream of incoming video data, we introduce two methods for improved training and inference of backbone action recognition models, allowing them to be deployed directly for online frame level classification. Firstly, we introduce surround dense sampling whilst training to facilitate training vs. inference clip matching and improve segment boundary predictions. Secondly, we introduce an Online Temporally Aware Label Cleaning (O-TALC) strategy to explicitly reduce oversegmentation during online inference. As our methods are backbone invariant, they can be deployed with computationally efficient spatio-temporal action recognition models capable of operating in real time with a small segmentation latency. We show our method outperforms similar online action segmentation work as well as matches the performance of many offline models with access to full temporal resolution when operating on challenging fine-grained datasets.
- Mario Aehnelt and Sebastian Bader. 2014. Tracking Assembly Processes and Providing Assistance in Smart Factories. ICAART 2014 - Proceedings of the 6th International Conference on Agents and Artificial Intelligence 1.
- How Much Temporal Long-Term Context is Needed for Action Segmentation? arXiv:2308.11358 [cs.CV]
- J. Carreira and A. Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, 4724–4733. https://doi.org/10.1109/CVPR.2017.502
- HAA500: Human-Centric Atomic Action Dataset with Curated Videos. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2020), 13445–13454. https://api.semanticscholar.org/CorpusID:221640805
- Scaling Egocentric Vision: The EPIC-KITCHENS Dataset. arXiv:1804.02748 [cs.CV]
- Every Mistake Counts in Assembly. arXiv:2307.16453 [cs.AI]
- Temporal Action Segmentation: An Analysis of Modern Technique.
- Yazan Farha and Jurgen Gall. 2019. MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation. (06 2019), 3570–3579. https://doi.org/10.1109/CVPR.2019.00369
- WOAD: Weakly Supervised Online Action Detection in Untrimmed Videos. 1915–1923. https://doi.org/10.1109/CVPR46437.2021.00195
- Online Action Detection. (04 2016).
- Weakly-Supervised Online Action Segmentation in Multi-View Instructional Videos.
- Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 770–778. https://api.semanticscholar.org/CorpusID:206594692
- Efficient Two-Stream Network for Online Video Action Segmentation. IEEE Access 10 (2022), 90635–90646. https://doi.org/10.1109/ACCESS.2022.3201208
- The Kinetics Human Action Video Dataset. arXiv:1705.06950 [cs.CV]
- MoViNets: Mobile Video Networks for Efficient Video Recognition. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, 16015–16025. https://doi.org/10.1109/CVPR46437.2021.01576
- MoViNets: Mobile Video Networks for Efficient Video Recognition. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), 16015–16025. https://api.semanticscholar.org/CorpusID:232307534
- Temporal Convolutional Networks for Action Segmentation and Detection. arXiv:1611.05267 [cs.CV]
- Learning convolutional action primitives for fine-grained action recognition. In 2016 IEEE International Conference on Robotics and Automation (ICRA). 1642–1649. https://doi.org/10.1109/ICRA.2016.7487305
- MS-TCN++: Multi-Stage Temporal Convolutional Network for Action Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020), 1–1. https://doi.org/10.1109/TPAMI.2020.3021756
- TSM: Temporal Shift Module for Efficient Video Understanding. 7082–7092. https://doi.org/10.1109/ICCV.2019.00718
- Diffusion Action Segmentation. arXiv:2303.17959 [cs.CV]
- Hand Guided High Resolution Feature Enhancement for Fine-Grained Atomic Action Segmentation within Complex Human Assemblies. 2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW) (2022), 1–10. https://api.semanticscholar.org/CorpusID:254018167
- LAP-Net: Adaptive Features Sampling via Learning Action Progression for Online Action Detection.
- A database for fine grained activity detection of cooking activities. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. 1194–1201. https://doi.org/10.1109/CVPR.2012.6247801
- MobileNetV2: Inverted Residuals and Linear Bottlenecks. arXiv:1801.04381 [cs.CV]
- Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), 21064–21074. https://api.semanticscholar.org/CorpusID:247762252
- C2F-TCN: A Framework for Semi- and Fully-Supervised Temporal Action Segmentation. IEEE Transactions on Pattern Analysis &; Machine Intelligence 45, 10 (oct 2023), 11484–11501. https://doi.org/10.1109/TPAMI.2023.3284080
- Generating Notifications for Missing Actions: Don’t Forget to Turn the Lights Off!. In 2015 IEEE International Conference on Computer Vision (ICCV). 4669–4677. https://doi.org/10.1109/ICCV.2015.530
- Sebastian Stein and Stephen J. McKenna. 2013. Combining embedded accelerometers with computer vision for recognizing food preparation activities. Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing (2013). https://api.semanticscholar.org/CorpusID:2333743
- Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In Computer Vision – ECCV 2016, Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer International Publishing, Cham, 20–36.
- Streaming Video Temporal Action Segmentation In Real Time. ArXiv abs/2209.13808 (2022). https://api.semanticscholar.org/CorpusID:252567936
- Temporal Recurrent Networks for Online Action Detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society, Los Alamitos, CA, USA, 5531–5540. https://doi.org/10.1109/ICCV.2019.00563
- Long Short-Term Transformer for Online Action Detection. In Conference on Neural Information Processing Systems (NeurIPS).
- Privileged Knowledge Distillation for Online Action Detection. arXiv:2011.09158 [cs.CV]
- HA-ViD: A Human Assembly Video Dataset for Comprehensive Assembly Knowledge Understanding.