Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

O-TALC: Steps Towards Combating Oversegmentation within Online Action Segmentation (2404.06894v1)

Published 10 Apr 2024 in cs.CV

Abstract: Online temporal action segmentation shows a strong potential to facilitate many HRI tasks where extended human action sequences must be tracked and understood in real time. Traditional action segmentation approaches, however, operate in an offline two stage approach, relying on computationally expensive video wide features for segmentation, rendering them unsuitable for online HRI applications. In order to facilitate online action segmentation on a stream of incoming video data, we introduce two methods for improved training and inference of backbone action recognition models, allowing them to be deployed directly for online frame level classification. Firstly, we introduce surround dense sampling whilst training to facilitate training vs. inference clip matching and improve segment boundary predictions. Secondly, we introduce an Online Temporally Aware Label Cleaning (O-TALC) strategy to explicitly reduce oversegmentation during online inference. As our methods are backbone invariant, they can be deployed with computationally efficient spatio-temporal action recognition models capable of operating in real time with a small segmentation latency. We show our method outperforms similar online action segmentation work as well as matches the performance of many offline models with access to full temporal resolution when operating on challenging fine-grained datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Mario Aehnelt and Sebastian Bader. 2014. Tracking Assembly Processes and Providing Assistance in Smart Factories. ICAART 2014 - Proceedings of the 6th International Conference on Agents and Artificial Intelligence 1.
  2. How Much Temporal Long-Term Context is Needed for Action Segmentation? arXiv:2308.11358 [cs.CV]
  3. J. Carreira and A. Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, 4724–4733. https://doi.org/10.1109/CVPR.2017.502
  4. HAA500: Human-Centric Atomic Action Dataset with Curated Videos. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2020), 13445–13454. https://api.semanticscholar.org/CorpusID:221640805
  5. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset. arXiv:1804.02748 [cs.CV]
  6. Every Mistake Counts in Assembly. arXiv:2307.16453 [cs.AI]
  7. Temporal Action Segmentation: An Analysis of Modern Technique.
  8. Yazan Farha and Jurgen Gall. 2019. MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation. (06 2019), 3570–3579. https://doi.org/10.1109/CVPR.2019.00369
  9. WOAD: Weakly Supervised Online Action Detection in Untrimmed Videos. 1915–1923. https://doi.org/10.1109/CVPR46437.2021.00195
  10. Online Action Detection. (04 2016).
  11. Weakly-Supervised Online Action Segmentation in Multi-View Instructional Videos.
  12. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 770–778. https://api.semanticscholar.org/CorpusID:206594692
  13. Efficient Two-Stream Network for Online Video Action Segmentation. IEEE Access 10 (2022), 90635–90646. https://doi.org/10.1109/ACCESS.2022.3201208
  14. The Kinetics Human Action Video Dataset. arXiv:1705.06950 [cs.CV]
  15. MoViNets: Mobile Video Networks for Efficient Video Recognition. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, 16015–16025. https://doi.org/10.1109/CVPR46437.2021.01576
  16. MoViNets: Mobile Video Networks for Efficient Video Recognition. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), 16015–16025. https://api.semanticscholar.org/CorpusID:232307534
  17. Temporal Convolutional Networks for Action Segmentation and Detection. arXiv:1611.05267 [cs.CV]
  18. Learning convolutional action primitives for fine-grained action recognition. In 2016 IEEE International Conference on Robotics and Automation (ICRA). 1642–1649. https://doi.org/10.1109/ICRA.2016.7487305
  19. MS-TCN++: Multi-Stage Temporal Convolutional Network for Action Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020), 1–1. https://doi.org/10.1109/TPAMI.2020.3021756
  20. TSM: Temporal Shift Module for Efficient Video Understanding. 7082–7092. https://doi.org/10.1109/ICCV.2019.00718
  21. Diffusion Action Segmentation. arXiv:2303.17959 [cs.CV]
  22. Hand Guided High Resolution Feature Enhancement for Fine-Grained Atomic Action Segmentation within Complex Human Assemblies. 2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW) (2022), 1–10. https://api.semanticscholar.org/CorpusID:254018167
  23. LAP-Net: Adaptive Features Sampling via Learning Action Progression for Online Action Detection.
  24. A database for fine grained activity detection of cooking activities. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. 1194–1201. https://doi.org/10.1109/CVPR.2012.6247801
  25. MobileNetV2: Inverted Residuals and Linear Bottlenecks. arXiv:1801.04381 [cs.CV]
  26. Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), 21064–21074. https://api.semanticscholar.org/CorpusID:247762252
  27. C2F-TCN: A Framework for Semi- and Fully-Supervised Temporal Action Segmentation. IEEE Transactions on Pattern Analysis &; Machine Intelligence 45, 10 (oct 2023), 11484–11501. https://doi.org/10.1109/TPAMI.2023.3284080
  28. Generating Notifications for Missing Actions: Don’t Forget to Turn the Lights Off!. In 2015 IEEE International Conference on Computer Vision (ICCV). 4669–4677. https://doi.org/10.1109/ICCV.2015.530
  29. Sebastian Stein and Stephen J. McKenna. 2013. Combining embedded accelerometers with computer vision for recognizing food preparation activities. Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing (2013). https://api.semanticscholar.org/CorpusID:2333743
  30. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In Computer Vision – ECCV 2016, Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer International Publishing, Cham, 20–36.
  31. Streaming Video Temporal Action Segmentation In Real Time. ArXiv abs/2209.13808 (2022). https://api.semanticscholar.org/CorpusID:252567936
  32. Temporal Recurrent Networks for Online Action Detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society, Los Alamitos, CA, USA, 5531–5540. https://doi.org/10.1109/ICCV.2019.00563
  33. Long Short-Term Transformer for Online Action Detection. In Conference on Neural Information Processing Systems (NeurIPS).
  34. Privileged Knowledge Distillation for Online Action Detection. arXiv:2011.09158 [cs.CV]
  35. HA-ViD: A Human Assembly Video Dataset for Comprehensive Assembly Knowledge Understanding.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com