Efficient and Effective Weakly-Supervised Action Segmentation via Action-Transition-Aware Boundary Alignment (2403.19225v1)
Abstract: Weakly-supervised action segmentation is a task of learning to partition a long video into several action segments, where training videos are only accompanied by transcripts (ordered list of actions). Most of existing methods need to infer pseudo segmentation for training by serial alignment between all frames and the transcript, which is time-consuming and hard to be parallelized while training. In this work, we aim to escape from this inefficient alignment with massive but redundant frames, and instead to directly localize a few action transitions for pseudo segmentation generation, where a transition refers to the change from an action segment to its next adjacent one in the transcript. As the true transitions are submerged in noisy boundaries due to intra-segment visual variation, we propose a novel Action-Transition-Aware Boundary Alignment (ATBA) framework to efficiently and effectively filter out noisy boundaries and detect transitions. In addition, to boost the semantic learning in the case that noise is inevitably present in the pseudo segmentation, we also introduce video-level losses to utilize the trusted video-level supervision. Extensive experiments show the effectiveness of our approach on both performance and training speed.
- How much temporal long-term context is needed for action segmentation? arXiv preprint arXiv:2308.11358, 2023.
- Weakly supervised action labeling in videos under ordering constraints. In Proceedings of the European Conference on Computer Vision, pages 628–643. Springer, 2014.
- Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
- D3tw: Discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3546–3555, 2019.
- Learning discriminative prototypes with dynamic time warping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8395–8404, 2021.
- Action segmentation with joint self-supervised temporal domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9454–9463, 2020.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Leveraging action affinity and continuity for semi-supervised temporal action segmentation. In European Conference on Computer Vision, pages 17–32. Springer, 2022.
- Weakly-supervised action segmentation with iterative soft boundary assignment. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6508–6516, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Efficient u-transformer with boundary-aware loss for action segmentation. arXiv preprint arXiv:2205.13425, 2022a.
- Fast and unsupervised action boundary detection for action segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3323–3332, 2022b.
- Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3575–3584, 2019.
- Global2local: Efficient structure search for video action segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16805–16814, 2021.
- Weakly-supervised online action segmentation in multi-view instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13780–13790, 2022.
- Weakly-supervised action segmentation and unseen error detection in anomalous instructional videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10128–10138, 2023.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
- Connectionist temporal modeling for weakly supervised action labeling. In Proceedings of the European Conference on Computer Vision, pages 137–153. Springer, 2016.
- All tokens matter: Token labeling for training better vision transformers. Advances in neural information processing systems, 34:18590–18602, 2021.
- Uboco: Unsupervised boundary contrastive learning for generic event boundary detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20073–20082, 2022.
- The language of actions: Recovering the syntax and semantics of goal-directed human activities. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 780–787, 2014.
- Weakly supervised learning of actions from transcripts. Computer Vision and Image Understanding, 163:78–89, 2017.
- A hybrid rnn-hmm approach for weakly supervised temporal action segmentation. IEEE transactions on pattern analysis and machine intelligence, 42(4):765–779, 2018.
- General multi-label image classification with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16478–16488, 2021.
- Weakly supervised energy-based learning for action segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6243–6251, 2019.
- Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
- Diffusion action segmentation. arXiv preprint arXiv:2303.17959, 2023.
- Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Weakly-supervised action segmentation and alignment via transcript-aware union-of-subspaces learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8085–8095, 2021.
- Set-supervised action learning in procedural task videos via pairwise order consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19903–19913, 2022.
- Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2630–2640, 2019.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- A generalized and robust framework for timestamp supervision in temporal action segmentation. In Proceedings of the European Conference on Computer Vision, pages 279–296. Springer, 2022.
- Weakly supervised action learning with rnn based fine-to-coarse modeling. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 754–763, 2017.
- Neuralnetwork-viterbi: A framework for weakly supervised video learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7386–7395, 2018.
- Semi-weakly-supervised learning of complex actions from instructional task videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3344–3354, 2022.
- Learning to segment actions from visual and language instructions via differentiable weak sequence alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10156–10165, 2021.
- On evaluating weakly supervised action segmentation methods. arXiv preprint arXiv:2005.09743, 2020.
- Fast weakly supervised action segmentation using mutual consistency. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6196–6208, 2021.
- Fifa: Fast inference approximation for action segmentation. In Proceedings of the DAGM German Conference on Pattern Recognition, pages 282–296. Springer, 2022.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision, pages 3551–3558, 2013.
- On layer normalization in the transformer architecture. In International Conference on Machine Learning, pages 10524–10533. PMLR, 2020.
- Multi-class token transformer for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4310–4319, 2022.
- Lac-latent action composition for skeleton-based action segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13679–13690, 2023.
- Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568, 2021.
- Hoi-aware adaptive network for weakly-supervised action segmentation. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, pages 1722–1730, 2023.
- Cross-task weakly supervised learning from instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3537–3545, 2019.
- Angchi Xu (1 paper)
- Wei-Shi Zheng (148 papers)