Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 91 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 102 tok/s
GPT OSS 120B 462 tok/s Pro
Kimi K2 181 tok/s Pro
2000 character limit reached

Efficient Temporal Action Segmentation via Boundary-aware Query Voting (2405.15995v1)

Published 25 May 2024 in cs.CV

Abstract: Although the performance of Temporal Action Segmentation (TAS) has improved in recent years, achieving promising results often comes with a high computational cost due to dense inputs, complex model structures, and resource-intensive post-processing requirements. To improve the efficiency while keeping the performance, we present a novel perspective centered on per-segment classification. By harnessing the capabilities of Transformers, we tokenize each video segment as an instance token, endowed with intrinsic instance segmentation. To realize efficient action segmentation, we introduce BaFormer, a boundary-aware Transformer network. It employs instance queries for instance segmentation and a global query for class-agnostic boundary prediction, yielding continuous segment proposals. During inference, BaFormer employs a simple yet effective voting strategy to classify boundary-wise segments based on instance segmentation. Remarkably, as a single-stage approach, BaFormer significantly reduces the computational costs, utilizing only 6% of the running time compared to state-of-the-art method DiffAct, while producing better or comparable accuracy over several popular benchmarks. The code for this project is publicly available at https://github.com/peiyao-w/BaFormer.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. H. Ahn and D. Lee. Refining action segmentation with hierarchical video representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16302–16310, 2021.
  2. How much temporal long-term context is needed for action segmentation? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10351–10361, 2023.
  3. Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In ECCV, 2022.
  4. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 213–229. Springer, 2020.
  5. J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  6. Uncertainty-aware representation learning for action segmentation. In IJCAI, volume 2, page 6, 2022.
  7. Action segmentation with joint self-supervised temporal domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9454–9463, 2020.
  8. Group detr: Fast detr training with group-wise one-to-many assignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6633–6642, 2023.
  9. Mask2former for video instance segmentation. arXiv preprint arXiv:2112.10764, 2021.
  10. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021.
  11. Introduction to the special section on video surveillance. IEEE Transactions on pattern analysis and machine intelligence, 22(8):745–746, 2000.
  12. A system for video surveillance and monitoring. VSAM final report, 2000(1-68):1, 2000.
  13. Dynamic detr: End-to-end object detection with dynamic attention. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2988–2997, 2021.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  15. Y. A. Farha and J. Gall. Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3575–3584, 2019.
  16. Learning to recognize objects in egocentric activities. In CVPR 2011, pages 3281–3288. IEEE, 2011.
  17. Global2local: Efficient structure search for video action segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16805–16814, 2021.
  18. Activity grammars for temporal action segmentation. Advances in Neural Information Processing Systems, 36, 2024.
  19. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  20. Spotting temporally precise, fine-grained events in video. In European Conference on Computer Vision, pages 33–51. Springer, 2022.
  21. Alleviating over-segmentation errors by detecting action boundaries. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2322–2331, 2021.
  22. Video action segmentation via contextually refined temporal keypoints. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13836–13845, 2023.
  23. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  24. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  25. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 780–787, 2014.
  26. A hybrid rnn-hmm approach for weakly supervised temporal action segmentation. IEEE transactions on pattern analysis and machine intelligence, 42(4):765–779, 2018.
  27. H. W. Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
  28. Temporal convolutional networks for action segmentation and detection. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 156–165, 2017.
  29. P. Lei and S. Todorovic. Temporal deformable residual networks for action segmentation in videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6742–6751, 2018.
  30. Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE transactions on pattern analysis and machine intelligence, 2020.
  31. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  32. Diffusion action segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10139–10149, 2023.
  33. DAB-DETR: Dynamic anchor boxes are better queries for DETR. In International Conference on Learning Representations, 2022.
  34. Detection transformer with stable matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6491–6500, 2023.
  35. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016.
  36. Neuralnetwork-viterbi: A framework for weakly supervised video learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7386–7395, 2018.
  37. Sparse detr: Efficient end-to-end object detection with learnable sparsity. arXiv preprint arXiv:2111.14330, 2021.
  38. A new dataset and approach for timestamp supervised action segmentation using human object interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3132–3141, 2023.
  39. S. Stein and S. J. McKenna. Combining embedded accelerometers with computer vision for recognizing food preparation activities. In Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing, pages 729–738, 2013.
  40. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  41. Temporal relational modeling with self-supervision for action segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 2729–2737, 2021.
  42. P. Wang and H. Ling. Dir-as: Decoupling individual identification and temporal reasoning for action segmentation. arXiv preprint arXiv:2304.02110, 2023.
  43. Boundary-aware cascade networks for temporal action segmentation. In European Conference on Computer Vision, pages 34–51. Springer, 2020.
  44. Don’t pour cereal into coffee: Differentiable temporal logic for temporal action segmentation. Advances in Neural Information Processing Systems, 35:14890–14903, 2022.
  45. Asformer: Transformer for action segmentation. BMVC, 2021.
  46. Detrs with collaborative hybrid assignments training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6748–6758, 2023.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com